Privacy compliant multiple dataset correlation system

ABSTRACT

A system and method for using inverse mathematical principles in the analysis of compatible datasets so that correlations and trends within and between said datasets can be uncovered. The present invention is tailored to the analysis of datasets that are extremely large; result from passive, privacy-secure, or anonymous, data collection; and are relatively unbiased. Correlations and trends uncovered by such analysis can be further examined by data mining and prediction portions of the present invention, which uncover and make use of interrelated rules that determine data structures. An embodiment directed toward analysis of television viewership and marketing data that does this while still respecting privacy concerns is disclosed. In a preferred embodiment, a satellite, internet, cable, or other content provider can provide a viewer with a set-top box which may be specially instrumented to allow monitoring, recording, and transmission of set-top box events. While the analysis of television viewership and marketing data is presently preferred, it will be apparent to one skilled in the art that the system and method herein can be employed to other data collection and data analysis scenarios. Other contemplated embodiments include, but are not limited to, privacy-secure actuarial analysis, radio and Internet market data collection, and even consumer behavioral predictions for advanced marketing techniques.

REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims priority from Provisional U.S.Patent Application Ser. No. 60/176,177, filed Jan. 13, 2000, and theProvisional U.S. Patent Application is incorporated by reference in itsentirety.

FIELD OF THE INVENTION

[0002] The present invention relates to the fields of data collectionand data analysis. In particular, the present invention provides asystem and method for privacy-secure data collection and correlation ofsuch data with data from other sources.

BACKGROUND OF THE INVENTION

[0003] Advertisers tend to group prospective customers into broaddemographic and geographic categories, possibly due to limitations incurrently available market research methods with respect todetermination of the effect of their advertisements. In addition, theyuse information gleaned from data mining to mass-market products togroups of prospective buyers. Unfortunately, the data searched duringthis data mining often contains low-validity information that is derivedfrom small sample populations.

[0004] Due to these inherent data validity problems, statisticsgenerated by such data mining may not accurately reflect a given market.That is, the statistics may not mean that all persons in a group willbuy a product, but rather they imply that some person in a group mayhave a higher probability of buying the product than someone in anothercategorized group. For example, the data mining may show that more scubaequipment could be sold to 20-40 year-olds in Miami than to 50-80year-olds in Kansas City.

[0005] Based on this data, advertisers carefully select the televisionshows, magazines, billboards, or other media on or in which theiradvertisements run. In the case of television, advertisers traditionallygravitate toward programs that garner higher ratings for desiredaudiences and then select advertising slots within those shows.Advertisers purchase ratings data from market research organizations,who collect and analyze data on the viewing habits of individuals andthen publish the results.

[0006] Examples of such research organizations include A. C. Nielson andArbitron. Such companies typically monitor television-viewing habits ofa relatively small number of viewers through telephone polls,specialized set-top monitoring “Nielson” boxes, or viewer diaries. Theresults of these surveys are then extrapolated to the population atlarge.

[0007] As can be expected, extrapolation of small-population data to thepopulation at large is prone to many different limitations, withaccuracy perhaps the most notable. For example, if there were only 200persons over 65 years of age in a sample, their compiled viewingbehaviors may be purported to be representative of the viewing behaviorsof the 35 million people in the U.S. over 65 years of age.

[0008] Obviously, larger, more random sample populations are preferredover smaller sample populations. This is true because a larger samplepopulation tends to reduce the impact of suspect behaviors. Such suspectbehavior might include distorted or inaccurate information provided inwritten television viewing logs, or intentionally leaving the television“on” to a certain channel to insure higher ratings for a desired showeven if the individual being sampled is not watching that show. If thebehavior of even one of the 200 persons in the previous example wassuspect, this may translate to errors in the predictions ofapproximately 175,000 people; if the sample population is increased to50,000 people, an individual whose behavior was suspect would translateinto prediction errors for only approximately 700 people. As advertiserscontinue to base their decisions on small-sample data, they arecontinuing to question whether their advertisements are reachingintended audiences.

[0009] While accuracy is certainly a big problem in the prior art, it isnot the only problem. Another limitation is the specificity with whichbehaviors may be inferred as they pertain to specific demographicgroups. For example, if only one of 200 sampled senior citizens is asingle Asian with no dependents and has an annual income over $100,000,making an inference based on this more specific group is likely to behighly inaccurate; in many cases the behaviors of an entire demographicsub-group are attributed to the sampled behavior of only one person.

[0010] Another factor contributing to the inaccuracy of prior art isreliability. Invasive sampling methods such as those described above cancause many problems, including determining how much of the data can betrusted. Sampled individuals may not to be willing to disclose, forexample, that they watch adult (e.g., X-rated) programming or othercontroversial programming. Without such information, all data generatedbecomes unreliable.

[0011] Still another problem is that even if the sample data can betrusted, the memory of a sampled individual or the ability of a sampledindividual to adhere to documented guidelines may not be accurate orcomplete. If a given individual is asked what they watched last week,the likelihood that the response may be correct and specific is likelyto be low. Often, low response rates or missing journal information areextrapolated according to previously collected data and rules determinedtherefrom. However, this extrapolation is built on data generatedthrough the inherently faulty means described above.

[0012] The invasive sampling techniques used in the prior art alsosuffer from an inherent flaw. Since these methods are invasive andparticipation is optional, differences between the types of persons whomay be willing to be sampled and those that are not willing to besampled may not be accounted for in such techniques.

[0013] While the effects of some of the problems in the prior art can belimited by increasing the population sample size, population sample sizeincreases are typically cost prohibitive. The increased costs are theresult of several factors, including equipment purchase, installation,and repair; data collection and validation; and participantcompensation.

[0014] However, even when equipment and other costs are taken out ofconsideration and larger samples are collected, such an increase insampling size does not solve all of the problems in the prior art. Forexample, the prior art also faces a problem with data resolution. Mostmajor media research organizations consider data in an all-or-nothingfashion. For example, if a set of channels was watched during somesampling interval, only the channel that was watched the most, or theone watched at the time of the sample, would be counted, and it would berecorded as having been watched for the entire sampling interval(typically anywhere from 30 seconds to several hours). Although some inthe prior art have attempted to mitigate this effect by sampling morefrequently, there is always the possibility that changes occurringbetween samples will be missed. Thus, the use of data collection methodsemployed by the prior art tends to result in the generation ofmisleading or inaccurate viewing data.

[0015] Data collected by media research organizations and inferencesresulting therefrom face still another problem; one of substance. Thefact that overlapping data is collected across different medium types(digital, written, verbal, etc.) makes the determination of commondenominators difficult, and thus renders objective statistical miningimpossible. Inferences drawn from such data may only be lateral innature, and cannot be readily mined for trends. For example, while thedata collected may support the conclusion that one show is more popularthan another, the particular reason why one is more popular than theother cannot be extracted from this data. Such methods may be barelycapable of supporting the most general popularity-type conclusions; anyfurther analysis upon relationships of the conclusions is likely to bequestionable at best, and accuracy may be lost each time morecomplicated, or deeper, inferences are drawn.

[0016] Unfortunately, there are many other problems with existing marketresearch methodologies, such as the use of “Sweeps” or ratings periods,but most of these problems are at least partiallystatistically-correctable. However, the five major issuesdiscussed—accuracy, group specificity, reliability, resolution, and datasubstance—are inherent to actively monitoring data within small samplesand cannot be overcome by the prior art.

SUMMARY OF THE INVENTION

[0017] The data collection techniques used in the prior art arise from amodel developed in the 1960's and 1970's. At that time, market researchdata collection and data transmission costs were very high, and a systemof periodic sampling was established. A thirty second sampling windowwas chosen because a given household had an average of three channelsavailable, and “surfing” was a non-existent phenomenon; thus, it was asafe assumption that the same channel was watched for the entiresampling period. Today, most television viewers have over 65 channelsavailable to them, and they are barraged with more commercials per houron each channel. This combination gives viewers incentive to frequentlychange channels within what would be a sampling period in the prior art.

[0018] Obviously, sampling methods from the 1970's are not capable ofaccurately representing television viewing habits in the year 2000 andbeyond. Thus, a need exists to more accurately sample viewer data,insure the data collected is not suspect, infer from such datarelationships and trends in viewing habits, project sampled data moreaccurately to the population at large, and determine not only whatshows, advertisements, or other content were watched, but also but whatportions of such content were watched. There also exists a need forratings systems which can more accurately and objectively provideratings of future programs. The data collection and data analysisaspects of the present invention can readily fulfill these needs.

[0019] A preferred embodiment of the present invention can provideadvertisers with accurate ratings predictions of commercials andprograms for specific demographic groups, rather than just providingoverall ratings of programs which have already aired. While a preferredembodiment of the present invention involves television viewershipstatistics, the present invention can draw correlations between anydataset combinations, such as, but not limited to, television program orcommercial viewership and sales figures, or sales figures anddemographics. The present invention may provide advertisers with abetter understanding of both consumer needs and their own advertisingneeds.

[0020] One aspect of a preferred embodiment of the present inventionprovides a system and method by which viewing behaviors of televisionviewers can be extracted electronically. While instrumentation andinfrastructure development costs may be initially high, the presentinvention can allow data collection from a vast number of householdswithout significant data collection, data storage, and data analysiscost increases as the number of households increases, or as the numberof times data is collected per household is increased.

[0021] The present invention takes a different approach to televisionmarket data collection than the prior art. Rather than periodicallysampling user behavior, the present invention tracks user behaviors byrecording set-top box events. Such a set-top box may record eventsincluding, but not limited to, set-top box state changes, such as aset-top box being turned on or off, channel changes, volume changes, theuse of an SAP feature, or muting of particular content; the use ofinteractive content guides; Internet web site usage; and combinations ofsuch events. Recorded set-top box events data may be periodicallytransmitted to a central data collection point where data analysis maybegin, or such transmission may occur instantaneously. In a preferredembodiment, this data collection method allows data to be gatheredwithout requiring subjects to keep journals, push buttons, or even knowtheir behaviors are being observed. This can be seen as an improvementover the prior art, as the invasive data collection methods used thereinare likely to destroy data integrity.

[0022] The present invention also includes a method for miningcompatible datasets so that correlations and trends within and betweenthe datasets can be uncovered. The present invention is tailored to theanalysis of datasets that are extremely large; result from passive,privacy-secure data collection; and are relatively unbiased, such asdatasets collected by set-top boxes described above. While the analysisof television marketing data is presently preferred, it will be apparentto one skilled in the art that the system and method herein can beemployed in other data collection and data analysis scenarios. Othercontemplated embodiments include, but are not limited to, privacy-secureactuarial analysis, radio and Internet market data collection, andthought and behavioral predictions for artificial intelligence effortsand governmental planning.

[0023] The data mining and prediction portions of the present inventionattempt to uncover the interrelated rules that cause various data toarise, and a preferred embodiment does so while still respecting privacyconcerns. Privacy can be maintained through anonymous data collection,which can be accomplished through a software upgrade to a standardset-top box. In a preferred embodiment, a satellite, cable, or othertelevision provider (“cable company”) can provide a viewer with aset-top box which may be specially instrumented to allow monitoring,recording, and transmission of set-top box events, as described above.

[0024] Traditional set-top boxes include a unique identification number(“ID”), and this number can be used by a preferred embodiment of thepresent invention for identification purposes in lieu of personalinformation. To facilitate data analysis, a cable company can provide tothe present invention a geographically associated code, such as, but notlimited to, a zip code or telephone number prefix, that corresponds witheach set-top box. Given this information, set-top box ID's to bemonitored can be chosen through various means, including, but notlimited to, the present invention selecting set-top box ID's at random,the present invention selecting set-top box ID's based on geographiccoverage, a cable company selecting ID's based on its own criteria, orselecting all set-top box ID's. A combination of set-top box ID andgeographically associated codes allows the present invention to maintainparticipant privacy while still allowing for determination of detaileddemographic information through the inverse mathematical methodsdescribed herein.

[0025] Although privacy is an important part of the present invention,an alternative embodiment would allow set-top box operators to request alist of their viewing habits. This might be useful for parents orbusinesses wishing to monitor programs watched by their children oremployees during a given day, or parents or businesses wishing tomonitor other DATA1 datasets, such as internet viewing behaviorsexhibited by employees or family members.

[0026] Analysis of data collected through a privacy-oriented approachsuch as the set-top box method described above is inherentlyself-limiting, as only viewership information for a particular show orcommercial can be determined for a given time over the samplepopulation. While this may be of interest to advertisers, anadvertiser's real concern is that a show is reaching a particulartarget-market, and thus that they are spending their advertising moneyon shows which users of their products will watch. Thus, advertisersprefer detailed, grouped market research data, such as the ages,incomes, and other demographic information associated with a show'sviewers. Through its novel data-correlation scheme, the preferredembodiment of the present invention can determine such information whilestill maintaining the anonymity of those being sampled. However, thepresent invention is not limited to providing only correlations betweenuser behavior and demographic data; the present invention can drawcorrelations within and between any number of data sets with a commonfeature, such as the zip code of a given television viewer and thedemographics associated with that zip code, regardless of the datarepresented therein.

[0027] In a preferred embodiment, the present invention augmentstelevision viewing behavior data collected from set-top boxes withrelatively static data from outside sources, such as, but not limitedto, demographic information from demographic providers, news informationfrom news providers, weather information from weather providers, andsales information from advertisers, manufacturers and producers. Thisinformation can be used not only to increase the number of categoriesinto which individuals may be grouped, but also to take into accountspecific confounding events, such as a severe weather alert, a nationalor regional news story, a local school's play-off game, and specialpromotional offers. Demographic, regional, and other such relativelystatic data may be updated at intervals specific to the type of datacollected.

[0028] The present invention may make certain assumptions based oncollected data to reduce data storage requirements. These assumptionsare well-fitted with the use of matrix manipulation schemes. Forexample, the present invention may assume individual age is intrinsic toa person, household income is intrinsic to a household, and weatherpatterns are intrinsic to geographic regions. Thus, demographicinformation comprising a matrix needs not be mutually exclusive. Thatis, if weather is a considered factor, weather data need not becollected and stored for every person in a geographic region, but simplycan be held in one matrix that can be accessed for all people livingwithin that geographic region.

[0029] The present invention can draw inferences within and between datasets through a variety of means. In a preferred embodiment, the presentinvention may use inverse mathematical methods to perform the desiredanalyses. These methods can be more simply expressed as techniques oflinear and matrix algebra.

[0030] With established data mining and data comparison methods inplace, another step is to extrapolate any calculations to the rest ofthe country. Through the data-collection methods described above,present invention can track viewing behaviors, demographiccharacteristics, and preferences associated with a geographically basedgroup of people. If this same demographic information is obtained forthe country as a whole, the present invention can use the data storedtherein to extrapolate out its results to any region of the country, orto the country as a whole. The present invention includes, but is notlimited to, the development of an extrapolation system, which itselfinvolves the use of mined trends and will evolve and improve over time.For example, “white males age 70-80” in the non-sample population may beconsidered to have more similarities with all persons age 70-80 in thesample than with whites alone due to a dominance of the age factor inwhites. Due to characteristics particular to a given geographiclocation, such as, but not limited to, such a location including aresort community, or a local sports team in a playoff game, somefactors, such as age, race, and gender, or even entire geographicregions, may be ignored in extrapolation procedures.

[0031] While some in the prior art have attempted to provide statisticssimilar to those available through the present invention, none have doneso at confidence levels approaching those provided by the presentinvention. In addition, the present invention improves over the priorart by allowing the extrapolation of very specific cases, even when nodata exists for that specific case. For example, if no Single AsianMales, 23-24 years old, with partial custody of 1 child, one previousmarriage, with a B.S. in Chemistry, working as an assistant in aChemical Laboratory, having an income between $24,000-$27,000 per year,and living in a specific zip code in Miami, Fla. were in the sample, thepresent invention could still calculate an anticipated behavior basedupon combinations of subsets of these characteristics and their observedinfluences.

[0032] In fact, not only can the present invention infer viewingpreferences and other behaviors for previously aired content, but, asadditional sample data is collected, the present invention can alsopredict reactions to future content. The present invention cancharacterize previously aired content, such as a television program or acommercial, based on specific attributes thereof, such as volumechanges; color changes; changes in brightness or contrast; speed ofmotion; background music mood; content; genres; actors appearing on thescreen; plotlines; languages spoken; use of foul or offensive language;and the like. This information can then be cross-referenced againstviewer reaction to that content, and suggestions can be made to make thecontent more appealing to a particular audience. With a database ofviewer reactions to previously aired content, the present invention canalso be used to analyze proposed content before it is aired, or even tosuggest optimal programming content structure and substance.

[0033] It should be obvious to one skilled in the art that the systemand method described in this specification are not constrained by thesame limitations as traditional data collection and analysis techniques.The present invention provides non-invasive sample data collection,significantly increasing reliability. The viewing habits of increasinglyspecific demographic groups can be ascertained while still maintaininghigh accuracy levels.

[0034] Additionally, the behavior data resolution is so fine as to allowthe redefinition of television viewing behavior. For example, the exactpercentage of each program that each group watched can be determined. Inaddition, the answer to “What percentage of Group A watched at least 80%of Program 1 who watched less than 10% of Program 2 three weeks ago?” isjust as easy for the computers to determine as a seemingly simplerquestion. Furthermore, event data for every set-top box and geographicregion can be archived as a large, unbiased database of 1's and 0's;therefore mining the data for trends would entail literally 0% loss ofrelated-accuracy. This is to say that the data's substance is the sameregardless of specificity of analysis.

[0035] It should also be obvious to one skilled in the art that thesystem and method described above can be used not only to ratetelevision programs, but, unlike the prior art, the present inventioncan also rate television advertisements. The prior art is limited totelevision program ratings because the sampling periods required toaccurately rate television advertisements would result in more data thancan be accurately collected, handled and characterized by the prior art.The novel data collection method described above provides viewerbehavior data at a finer resolution than is possible through thesampling methods implemented by the prior art, and can thus be used todetermine viewer behavior at any instant, including sections during acommercial.

[0036] Once data is collected, a further aspect of the present inventionprovides advertisers or others wishing to analyze the data with aninteractive interface for such data analysis. Such an interface canallow data analysis requests to be entered through a variety ofinterfaces, such as through command-line queries, graphical interfaces,or even as natural-language questions. Further, an output representationmay be selected, including, but not limited to, raw data, pie chart,time-progression, and the like. The present invention may further trackfrequently requested analyses and automatically update those on aperiodic basis to expedite delivery of such information.

[0037] While the present invention improves over the prior art whenevaluated using current advertising schemes, the present invention canalso allow a new type of advertising. Rather than an advertiserpurchasing time during a given show that is broadcast to a largeaudience, many of whom may not be in a product's target audience, thepresent invention may allow advertisements to be delivered to only thoseset-top boxes whose viewers exhibit certain behaviors or exhibit apropensity toward specific products or services. This allows advertisersto directly reach those viewers who would be interested in anadvertiser's product or service, thus decreasing the cost per viewer ofrunning such advertisements. A very simple example of such is continuingto show bicycle-related commercials to those who haven't turned thechannel when bicycles have been shown in the past, or have recentlybought or searched online for a bicycle.

[0038] Thus, it can be seen that the present invention representssignificant improvements over the prior art. Not only can the presentinvention collect more reliable data through its use of private,non-invasive data collection techniques, but the present invention canalso provide data which lends itself to more advanced, thorough, andprivacy-secure analysis techniques. The present invention can alsoanalyze data more accurately than data analysis and data miningtechniques of the prior art. Further, the present invention allows dataanalyses to be performed on behaviors observed from a larger portion ofthe population than the prior art, and can more accurately extrapolatesuch data to the population in general.

BRIEF DESCRIPTION OF THE DRAWINGS

[0039]FIG. 1 is a block diagram providing a general overview of theconsumer data acquisition, prediction, and query system of the presentinvention.

[0040]FIG. 2 is a block diagram of the market data acquisition,prediction, and query system of the present invention.

[0041]FIG. 3 is a block diagram of a Tuner Data Collection System of thepresent invention.

[0042]FIG. 4 is a block diagram of a Past Events Query System of thepresent invention.

[0043]FIG. 5 is a block diagram of a Graphic System of the presentinvention.

[0044]FIG. 6 is a block diagram of the Individual Behavior DeterminationSystem of the present invention.

[0045]FIG. 7 is a block diagram of the Future Events Query System of thepresent invention.

[0046]FIG. 8 is a block diagram of the Program Entry and Program Buildersystems of the present invention.

[0047]FIG. 9 is a block diagram of the Data Mining and Prediction Systemof the present invention.

[0048]FIG. 10 is a block diagram illustrating aspects of a sample IDMCalculation Algorithm of FIGS. 2, 4, 5, 6 and 9.

[0049]FIG. 11 provides a high-level view of a technology infrastructureemployed in a preferred embodiment of the present invention.

[0050]FIG. 12 is a graph of linear equation values and weights for agiven geographic region.

[0051]FIG. 13 is an additional graph of linear equation values andweights for a given geographic region.

[0052]FIG. 14 is an alternative view of the graph of FIG. 13.

[0053]FIG. 15 is an alternative view of the graph of FIG. 14, andincludes additional details.

[0054]FIG. 16 is a sample, single-peak graph of linear equation valuesand weights for a given geographic region.

[0055]FIG. 17 is a sample, two-peak graph of linear equation values andweights for a given geographic region.

[0056]FIG. 18 is a sample, random pattern graph of linear equationvalues and weights for a given geographic region.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0057]FIG. 1 is a block diagram providing a general overview of consumerdata acquisition, prediction, and query systems of a preferredembodiment of the present invention and their interaction with eachother. In this embodiment, the present invention may monitor userbehavior while a user experiences television, radio, Internet, or othercontent. Examples of such content can include television shows, radioshows, music, advertisements, news, weather, and other multimedia orsensory-stimulating material.

[0058] Airings Data 110 comprises detailed content attributes. Examplesof such content attributes include times at which such content wasavailable; geographic or other regions to which such content was madeavailable; actors or models appearing in or otherwise associated withsuch content; types of characters portrayed by such actors or models;content authors, producers, and directors; content genres, subjects, andsettings; background music tones, tempo, and related characteristics;visual effect speed, colors, pixel change ratio, brightness, and otherrelated characteristics; scents and tastes associated with such content;and other such content attributes. Additional content attributes storedby Airings Data 110 may include general plot themes or plot styles, suchas comedic or dramatic segments; and a time- or position-based order inwhich such content attributes appear within such content.

[0059] As users view, listen to, or otherwise experience content, userbehavior may be monitored through a set-top box, personal computer,radio, portable music player, or other device (“set-top box”). In apreferred embodiment, user behavior may be monitored by recordingset-top box events. Tuner Data 120 may comprise a collection of suchuser behavior information. Tuner Data 120 may also comprise other userinformation, such as, but not limited to, billing information andpersonal demographic information.

[0060] Airings Data 110 and Tuner Data 120 may be transmitted to a datacenter of the present invention via a telecommunications infrastructure.In a preferred embodiment, such telecommunications infrastructure mayinclude cable television systems, satellite television systems,telephone systems, or other wired or wireless telecommunicationssystems.

[0061] In addition to the above-described information, the presentinvention may include data from external sources, as indicated in FIG. 1by Graphic Data 140. Graphic Data 140, also referred to as DATA2, mayinclude demographic, geographic, sales, weather, and other information(“demographic information”). Geographic information used by the presentinvention may include, but is not limited to, distances between zipcodes, population sizes within a zip code, and terrain types (coastaltown, metropolitan area, etc.) within a zip code. Demographicinformation used by the present invention may include, but is notlimited to, age, race, gender, and income distributions within a zipcode or sub-zip code. In a preferred embodiment, such data may be ofhigh enough resolution as to provide information at the zip code orsub-zip code level. The only requirement is that such data share acomment aspect, such as zip code, with Tuner Data 120 or any other DATA1data with which it may be correlated.

[0062] Data Center 130 represents a database or other data storagedevice in which data from Airings Data 110, Tuner Data 120, Graphic Data140, and the like can be stored. Data Center 130 may correlate datastored therein (such as correlating Airings Data 110 to Tuner Data 120),and such correlations may indicate content that was viewed, listened to,or otherwise experienced (“viewed”) by a user or group of users, and anyreactions thereto.

[0063] In a preferred embodiment, Algorithms 150 may use one or morestatistical methods to determine correlations among and between datastored in Data Center 130. Such correlations may include, for example,which persons and groups viewed certain content, and why. Past EventsQuery System 200 may allow users to extract meaningful and directedinformation from Data Center 130 using Algorithms 150 as describedabove. Past Events Query System 200 may focus on extraction ofprobabilities for events that have already taken place. These events caninclude, but are not limited to, past viewing habits of demographicgroups, past sales based on advertising, past sales and viewing based onweather, and the like.

[0064] In a preferred embodiment, Past Events Query System 200 maycomprise a web-based system that allows customers to query a database ofpast viewing behaviors. A web-based system may utilize a naturallanguage, graphical, or command-line input interface for such queries.The Past Events Query System 200 may allow a customer to query DataCenter 130 while preventing a customer from obtaining any informationabout individual consumer behaviors, or allowing a customer to duplicateprocesses employed in Algorithms 150 to produce such information.

[0065] As used in this specification, the term query can include anydata-related question asked of the present invention. A query mayconcern specific content, content portions, content combinations, ormixtures thereof. By way of example, without intending to limit thepresent invention, a query may be, “How many African Americans inFlorida, but not in Bradenton, watched at least 20% but no more than 80%of the primetime Friends last week, but did not watch at least 45% ofthe rerun of Seinfeld right before it or at the same time 3 weeks beforethat.”

[0066] While Past Events Query System 200 can generate statistics fortelevision airings or other events that happened in the past, thepresent invention is not limited to such queries. Customers, illustratedby Block 210, can also enter predictive queries through Future EventsQuery System 190. Future Events Query System 190 can, in turn, parsesuch requests into terms useable by Prediction System 180.

[0067] Prediction System 180 may predict future viewer behavior basedupon trends found in Data Center 130 and extracted using algorithms inAlgorithms 150. While these trends are of limited predictive use, suchtrends can be analyzed against demographic data specific to eachmonitored set-top box, thus providing better analysis of viewing trendsacross various demographic groups. Such analysis can be performed byIndividual Behavior Determination System 160. The addition of FutureAirings Data 170 allows further predictive refinement, as Future AiringsData 170 provides a basis onto which behaviors can be mapped orextrapolated.

[0068] Customers 210 of the system may access data that has beenanalyzed by Algorithms 150 through Past Events Query System 200 andFuture Events Query System 190. By these systems, a customer may tailorqueries so that Algorithms 150 or Prediction System 180 may answer them.Due to possible privacy concerns, Customers 210 may not have directaccess to Individual Behavior Determination System 160, thus restrictingaccess to behaviors of sampled households or persons.

[0069]FIG. 11 provides a high-level view of a technology infrastructureemployed in a preferred embodiment of the present invention. Asillustrated by FIG. 11, analog or digital set-top boxes (Blocks 1100 to1102) reside in viewer's homes, and can control content presentation.Such control can include, but is not limited to, the selection of atelevision channel or increasing or decreasing volume. Such set-topboxes may also include software which provides additional set-top boxfunctionality, such as, but not limited to, managing communicationsbetween a set-top box and a head-end (Block 1103), monitoring set-topbox events, forwarding events to a head-end, and managing bandwidthutilization via configurable application parameters.

[0070] A head-end bunker can house equipment that distributes contentdownstream to a group of households. In a preferred embodiment, ahead-end bunker can also include a combination of hardware and softwarethat monitors user behavior information from downstream set-top boxes(Block 1105). In a preferred embodiment, such set-top box data can betransmitted to a head end through a cable television cable, telephoneline, or other telecommunications infrastructure. Such transmissions canalso occur through a cable shared with return path equipment, eventhough such equipment may be separate from distribution equipment.

[0071] In a preferred embodiment, a head-end bunker, as used in theprior art, may be enhanced with the addition of a UNIX-based server(Block 1106) that is connected to return path equipment via atelecommunications infrastructure. Such a server may allow collection ofuser behavior information.

[0072] A preferred embodiment of the present invention also provides aserver with access to information from a customer billing database(Block 1104). Such billing system access can provide correlationsbetween set-top boxes and customer data, such as billing zip code,billing area code and prefix, and the like. To address privacy issuesregarding viewership, a preferred embodiment of the present inventionwill identify set-top box data by zip code, area code and prefix, orother geographic identifier associated with a region in which a set-topbox resides. Correlations between set-top boxes and zip codes can bemaintained in a cable television or other content provider's billingsystem; thus, access to such billing data may be preferred.

[0073] A highly available and highly reliable server is preferred forset-top box event monitoring, as such a devices may reside in a rack ata head-end bunker, and head-end bunkers may be physically disparate orin remote regions. A preferred embodiment of Server 1106 includes aUNIX-based server; a UNIX-based server is preferred as such servers mayreduce maintenance requirements. In addition, backup circuits may beimplemented to provide fault tolerance depending on availabilityrequirements for gathered data.

[0074] Server 1106 can also attach to a network access device (Block1107) to upload data gathered from set-top boxes to a data center. Suchnetwork access devices can include, but are not limited to, modems,routers, and satellite transceivers. As illustrated in FIG. 11, aprivate network link (Block 1108) is preferred for connecting a serverto a data center for data uploads, as well as network and systemsmanagement, and for other functions. However, such functions may also beaccomplished across a shared network, such as the Internet. Datatransmitted across public or private networks may be encrypted orotherwise encoded to reduce the likelihood that such data may be used byunauthorized individuals.

[0075] Data uploads may occur in real-time or data may be temporarilystored on a server and transmitted to a data center on a periodic basis.Such periods may be time based, or may be based on the occurrence of anevent, such as, but not limited to, receipt of a certain quantity ofdata or data from a particular set-top box.

[0076] In a preferred embodiment, data transmitted by Server 1106 may bereceived at a data center. Such a data center may be a centralrepository for all data gathered from a plurality of head-ends. FIG. 11includes an illustration of major data center components.

[0077] Data from Server 1106 may come into a data center through widearea circuits (Block 1109) and into temporary storage space (Block1110). Any data cleansing or pre-processing prior to import of such datainto the main database can be accomplished as data is stored thereon.Pre-processed data may then be imported into a main data store (Block1114).

[0078] In addition to user behavior data, a data center's main databasemay also store or access data from sources external to the presentinvention. Such external information may include, but is not limited to,content attributes from various content providers (Block 1111),demographic information from third party providers (Block 1112), andsales data from retailers or producers.

[0079] As illustrated in FIG. 11, data stored in a main data store mayalso be replicated to one or more databases for other purposes. In apreferred embodiment, data may be replicated to a database that isdedicated to Internet access (Block 1113), and another database that isdedicated to report generation (Block 1115). Such replication mayprovide data security, as data stored in one database can be comparedagainst data stored in other databases to ensure its authenticity. Datastorage and retention properties can also be adjusted for each server asneeded.

[0080] By way of example, without intending to limit the presentinvention, Internet Database 1113 can be configured to provide dedicatedaccess to a time-limited amount of viewership data. A web-basedapplication (Block 1117) can then provide customers with access to datain Internet Database 1113, and can also analyze and report on such data.Web servers (Blocks 1121 through 1123) can provide a front-end querysystem for customizing such analyses and viewing reports. In a preferredembodiment, fields and data made available through Internet Database1113 will also be structured to ensure that queries complete in areasonable period and that impact on other users is controlled.

[0081] As an alternative example, again without intending to limit thepresent invention, data replicated to a reporting database (Block 1116)can be used to create hard copy reports (Block 1118), electronic reports(Block 1119), and CD-ROM's (Block 1120) for customers who request accessto data by means other than through a Web interface. A reportingdatabase may also have time-limited data retention.

[0082]FIG. 2 is a detailed block diagram of market data acquisition,prediction, and query systems of a preferred embodiment of the presentinvention. Although FIG. 2 includes language specific to this preferredembodiment, the principles of the present invention are also illustratedthere, and can be seen with respect to any arbitrary data by replacing‘Tuner Data Center’ by ‘DATA1’, ‘Interval Updating Graphic Database’ by‘DATA2’, and ‘Sales Data’ by ‘DATA3’.

[0083] While a preferred embodiment of the present invention applies theconcepts of the present invention to a television ratings system, thepresent invention has other applications as well. Such applicationsinclude, but are not limited to, Internet advertising and actuarialanalysis in the insurance industry. Whenever multiple datasets exist andcorrelations are desired between such datasets, the present inventioncan draw such correlations provided at least one dataset is relativelystatic and a common aspect, such as a zip code, is shared between thedatasets. In the preferred embodiment disclosed in this specification,DATA1 can represent a variable data set, such as data from a set-topbox, and DATA2 may represent a relatively static dataset, such asdemographic data for a given geographic region. The system and methoddescribed herein can determine correlations between such datasetswithout direct knowledge of DATA2 values for DATA1 data.

[0084]FIGS. 3 through 9 illustrate individual modules of the presentinvention, and FIG. 2 illustrates the modules of FIGS. 3 through 9overlaid atop one another and interconnected, thereby illustratinginteroperability of various modules and relationships between suchmodules. Elements of FIG. 2 will be described below in connection withFIGS. 3 through 9.

[0085]FIG. 3 is a block diagram of a Tuner Data Collection component ofthe present invention. In the more general functionality provided by thepresent invention, FIG. 3 illustrates the acquisition of some data,DATA1. As it relates to a preferred embodiment, FIG. 3 is a flow chartof modules useful in moving tuner data to Data Center 130 of FIG. 1.

[0086] Set-Top Boxes 310 can comprise one or more set-top boxes, whichcan be located in one or more households. Set-Top Boxes 310 may collectand record event information based on behavior of one or more sampledusers, as well as embedded content attributes, where such attributes areavailable. Such embedded content attributes can allow the presentinvention to quickly match data about specific content to set-top boxevents, rather than pulling such attributes from external data sources.Embedded content attributes may pertain to content with which saidattributes are transmitted, or embedded content attributes may pertainto previously presented content or content to be made available in thefuture.

[0087] Set-Top Boxes 310 may constantly transmit state-changeinformation to HeadEnd Bunker 320, or Set-Top Boxes 310 may send batchesof state-change information to HeadEnd Bunker 320. HeadEnd Bunker 320can forward such state-change information, along with content attributes(Block 330), to Sorting System 900.

[0088] Sorting System 900 may comprise one or more sorting algorithmsthat place set-top box event data into efficient arrays. Due toreliability issues associated with data from set-top boxes operated by aconsumer who knows he or she is being observed, these sorting algorithmsmay separate data into two or more classes based on whether a set-topbox owner or operator has specifically requested access to monitoreddata, or is otherwise aware that they may be monitored.

[0089] In a preferred embodiment, the present invention may not collectindividual-specific information, such as name, size of family, name ortype of business, address, and the like from sampled users, with theexception of a zip code, area code and prefix, or other geographicidentifier. Local Cable Provider 340 or other entity, which may beacting as a privacy guard for a sampled population or a governmentalagency, may also provide these geographic identifiers.

[0090] While data acquired by HeadEnd Bunker 320 may contain embeddedcontent attributes, not all content may be so encoded. Non-embeddedProgram Information 350 may be acquired from a content airing source,such as Local Cable Provider 340, or possibly other sources, such asIntemet-based guides. Non-embedded Program Information 350 may compriseinformation that identifies content for which attributes are notavailable. Where such data is not electronically available, employeesmay make phone calls, consult published guides, or otherwise obtain suchdata through manual methods.

[0091] These latter methods and data collected thereby are illustratedin FIG. 3 as Airings Source 360. Airings Source 360 may also include alist of content that may be available at a time in the future. Inaddition, Production Team 370 may work with content creators to providecontent to Local Cable Provider 340 and provide Non-Embedded ProgramInformation 350 to Airings Source 360. Production Team 370 can includean organization working with content creators who have access to programdetails and airings times for Non-Embedded Program Information 350. Inits representation in the figures, Production Team 370 may refer to anyunit or process involved in content creation or distribution, such aswriters, producers, studios, networks, and the like.

[0092] As Sorting System 900 receives such data, appropriate sorting mayoccur and correlations may be drawn between such data and data fromBlock 330. Sorted data may be stored in Tuner Data Center 930. TunerData Center 930 may comprise a database of set-top box data arrays ofrelevant age. Arrays of information of non-relevant age may be storedoff-line, but may still be permanently accessible.

[0093]FIG. 4 is a block diagram of Past Events Query System 200 ofFIG. 1. Although illustrative of a preferred embodiment of the presentinvention, FIG. 4 also illustrates the general concepts of the presentinvention with respect to any arbitrary DATA1 and DATA2 if ‘Tuner DataCenter 930’ is replaced by ‘DATA1’ and ‘Interval Updating GraphicDatabase 620’ is replaced by ‘DATA2.’

[0094] Graphic Vendor 610 may comprise one or more data vendorssupplying the present invention with demographic data for a geographicor other region. Graphic Vendor 610 may provide such information brokeninto distribution units, where such distribution units share a commonfactor such as zip code or sub-zip code.

[0095] While Graphic Vendor 610 may supply such data in a preferredembodiment, an alternative embodiment can replace data supplied byGraphic Vendor 610 with data determined internally by a Graphic Systemas illustrated by FIG. 5. A Graphic System can use data collected by anIndividual Behavior System, which is illustrated in FIG. 6, to providenecessary data. In another embodiment, data from a Graphic System can beaugmented by data from Graphic Vendor 610 to provide the presentinvention with more comprehensive data.

[0096] In a preferred embodiment, the present invention may periodicallyrequest data from Graphic Vendor 610 and such data can be stored inInterval-Updating Graphic Database 620. As Interval-Updating GraphicDatabase 620 receives data from Graphic Vendor 610, data stored inInterval-Updating Graphic Database 620 may be modified to reflectchanges implied by data from Graphic Vendor 610. Interval-UpdatingGraphic Database 620 may then be used by the present invention as asource of graphic data.

[0097] The present invention may use data from Interval-Updating GraphicDatabase 620 to create data arrays representing relative datadistribution percentages. Such arrays can be compiled by IDGM GraphicMatrix 910 at any time. In a preferred embodiment, IDGM Graphic Matrix910 may be updated when new data is received by Interval-UpdatingGraphic Database 620.

[0098] IDGM Graphic Matrix 910 may create matrices for each set ofgraphic data. Such matrices may contain arrays that refer to zip codesor other geographic descriptors to which information contained within anarray corresponds. In a preferred embodiment, arrays may be formed withcolumn headings corresponding to graphic characteristics, such as, butnot limited to, gender or age, and rows corresponding to a set of zipcodes. A number corresponding to the percentage of said row that can beattributed to said column may be stored in the intersection of each rowand column. Thus, for example, if 65 percent of the population of aparticular zip code were male, 0.65 could be stored in the intersectionof the male column and the row corresponding to said zip code. Sucharrays can then be used by a Process Computer for matrix operations thatprovide numerical data to a report processor prior to delivery to acustomer.

[0099] Customers, illustrated in FIG. 4 by Market Customer 530, mayrequest such reports from Past Events Query System 200. The presentinvention may translate such a request into a mathematical formula, or amachine-language representation of such, through Post-Translation System280. Formulae created by Post-Translation System 280 may be interpretedby IDGM Calculation Algorithm 270 to properly extract and analyze datastored in IDGM Graphic Matrix 910 and Interval-Updating Graphic Database620.

[0100]FIG. 10 provides an overview of a sample IDGM CalculationAlgorithm 270 that can perform such analyses. As illustrated by Block1012, algorithms used by IDGM Calculation Algorithm 270 may takeadvantage of an assumption used by media researchers, which is that eachmember of a given DATA2 (Block 1011) group has the same probability ofexhibiting some DATA1 (Block 1010) behavior as any other member of thesame group. The present invention extrapolates from this an assumptionthat probabilities associated with behaviors of groups of people can bedetermined by their demographic specification (this is referred to asthe “demographic assumption”). IDGM Calculation Algorithm 270 usescalculations derived from these assumptions to develop DATA2correlations for DATA1 data without collecting DATA2 information aboutDATA1 data directly, and to determine confidence intervals associatedwith such correlations.

[0101] IDGM Calculation Algorithm 270 may use inverse mathematicalprinciples to find such correlations (Blocks 1013 and 1014). Thefollowing are two methods through which such correlations may bedetermined by IDGM Calculation Algorithm 270. While these examples areprovided for enablement and best mode purposes, these examples shouldnot be construed as limiting the present invention. In alternativeembodiments, subsets of these examples may be used, as may additionalcalculation methods.

[0102] A preferred embodiment of the present invention assumes that aperson's demographic description has some influence on his choice oftelevision viewing. A goal of this embodiment of IDGM CalculationAlgorithm 270 is to apply the inverse of this assumption; that is, aperson's demographics can be determined from his viewing habits. Toachieve this, the present invention can invert region-specific viewingand demographic data to compute demographic-specific viewinginformation.

[0103] The following are definitions that will aid in understanding thisembodiment:

[0104] Demographic Data—As with this specification as a whole, the term“demographic data” includes ordinary demographic categories as well asgeographic variables and general local characteristics. Examples ofordinary demographic categories include, but are not limited to, age,race, gender, income, education level, marital status, and number ofdependents. Examples of geographic variables include, but are notlimited to, climate and weather, urban or rural environment, coastal orinland geography, population density, and amount of traffic. Generallocal characteristics may include, but are not limited to, progress ofregional sports teams and local news events.

[0105] Demographic Characterization—A demographic characterization is aset of values for each of a given set of demographic categories.

[0106] Demographic Characterization Level—A demographic characterizationlevel is the number of categories comprising a demographiccategorization. For example, a level-one characterization might be aspecification of race, while a level-two characterization might be anage group together with a race.

[0107] Demographic Specification—A demographic specification is a fulldemographic characterization that uses all demographic categoriestracked by the present invention.

[0108] Demographic Aspect—Demographic aspects are potential demographiccategory values. For example, the category “gender” has aspects male andfemale.

[0109] Orthogonal Characterizations—A set of demographiccharacterizations is said to be orthogonal if there is no overlap amongthem; that is, a given person can fit into no more than one of them.

[0110] Complete Characterizations—A set of characterizations is said tobe complete if any given individual person necessarily falls in at leastone of the characterizations in the set.

[0111] Orthogonal and Complete Characterizations—A set ofcharacterizations is said to be orthogonal and complete if any givenindividual falls into exactly one characterization.

[0112] STB—A set-top box.

[0113] Program State—Program states reflect particular content, orportions and combinations of content, presented by an STB at aparticular time. For example, a program state could be defined as somespecific 10-second interval of a particular commercial which just aired,combined with an entire program that aired 2 weeks ago.

[0114] Tuner State—Tuner states represent a current STB state.

[0115] Event Rating—An event rating represents a probability that aperson or group of people, represented by a demographiccharacterization, matched or will match an STB event.

[0116] ERINRating—An erinRating is produced for each event rating byprorating all event ratings for a given geographic area and over a giventime period with a given set of program state choices. The number ofpersons in a geographic area exhibiting an event can be determined bymultiplying the number of persons matching a demographiccharacterization in a given area by an associated erinRating.

[0117] STB Event Time—STB event time is a time series that is defined bySTB event sampling. For example, if a cable provider's system isinoperable for 10 hours, this gap is not considered in STB event time.

[0118] A goal of the present invention is to determine the extent towhich the demographic assumption is valid. This determination can be afactor in calculating confidence intervals for data resulting from thepresent invention.

[0119] The demographic assumption can be expressed mathematically in thefollowing relation equation:

m _(k)=Σ_(i) p _(ki) v _(i)  Equation 1

[0120] Here m_(k) is an observed number of STB's in zip code kexperiencing some defined event, p_(ki) is the number of people in zipcode k with demographic characterization i, v_(i) is the fraction ofpeople of characterization i that are watching the event, and the sum isover a complete set of demographic characterizations i. This formulaembodies the demographic assumption because vi depends only on i. InEnglish, Equation 1 simply says that the total number of STB'sexperiencing a particular event can be determined by summing the numberof people of each demographic characterization that are experiencing theevent.

[0121] In a more general application of the present invention and itsrelated formulas, mk can be seen as corresponding to DATA1, and p_(ki)to DATA2.

[0122] Given that m_(k) values can be determined through the presentinvention, and p_(k) values can be obtained from demographics vendors orby other means, the present invention can use Equation 1 to solve forv_(i). This can be accomplished by defining an error functions such as:

ψ²=Σ_(k)Σ_(i)(m _(k) −p _(ki) v _(i))²  Equation 2

[0123] If a dataset under consideration contains more zip codes thandemographic characterizations, Equation 2 can be solved through astandard least-squares approach. If a dataset under considerationcontains fewer zip codes than demographic characterizations, fittingmethods may be applied prior to application of a standard least-squaresapproach. A least-squares approach can involve inverting a matrix basedon p_(ki), and for that reason is referred to as an inverse demographicmatrix, or IDM, solution. This is illustrated by Block 1014 in FIG. 10.

[0124] An IDM can be implemented in different manners, depending on theset of demographic characterizations. For example, if a particular queryinvolves only one category, such as age, then demographiccharacterizations can be defined across the whole set of age-intervals(0-10, 11-20, 21-30, . . . , 101-110, 111+, for example). In this case,a set of twelve age-intervals forms a complete (all individuals fallinto at least one interval), orthogonal (an individual falls into nomore than one interval) set of characterizations. A resulting p_(ki)matrix will then be a matrix of size N_(zip)×12, where N_(zip) is thenumber of zip codes used in the calculation. Equation 2 can then besolved for twelve values of v_(i), v₁ through v₁₂.

[0125] Alternatively, if a query involves only one particularage-interval, such as ages 11 to 20, a set of demographiccharacterizations with only two elements can be used, one set containingpeople between 11 and 20, and one containing all others. The resultingset is also complete and orthogonal, and can be represented in apkimatrix. In this case, the pki matrix is N_(zip)×2, and Equation 2 can besolved for two values of v_(i), v₁ (the number of people in the agerange 11 to 20 who are watching), and v₂ (the number of people of allother ages who are watching).

[0126] An IDM solution for a query involving a set of level-ncharacterizations may be referred to as IDMn. The age-group examplegiven above, using an N_(zip)×12 p_(ki) matrix, can thus be seen as anexample of IDM1. If a query involves two categories, for example agegroup and gender, then a complete, orthogonal, level-2 demographiccharacterization set would have 24 elements, and an IDM2 solution wouldinvolve a p_(ki) matrix of size N_(zip)×24.

[0127] When a query involves only a single characterization, such asages between 11 and 20, this may be referred to as IDMn-P. Thus, thesecond example in the previous paragraph is an IDM1-P solution. Asanother example, if a query only involved women in age interval 21 to30, then an IM2-P solution, for which the p_(ki) matrix would again beof size N_(zip)×2, could be used.

[0128] IDM solutions can be normalized with complementary IDMn-Psolutions to supplementary IDM(1, 2, 3 . . . n)-P solutions.Complementary normalization involves IDMn-P computation of allcharacterizations of the same “n” which contribute to a whole IDM(n-1)mutually exclusive demographic characterization. IDMn-P values can thenbe normalized to total the value computed for the IDM(n−1)-Pcharacterization (which was previously normalized itself if n>1). Allnormalization can begin at characterizations of lowest n value. By wayof example, when n equals 7, characterizations 1 through 6 should becalculated, so that normalization occurs between all two-levelcombinations.

[0129] While the present invention may calculate an IDM solution, such asolution may not be presented to a customer querying the presentinvention. Rather, the present invention may present a range of valuesthat fall within a particular level of statistical confidence. As STBusage expands to include the majority of the population, such valueranges may have a non-zero size due to a violation of the demographicassumption. Values for such ranges can be determined by uncertaintyestimations provided by a least-squares fit.

[0130] While STB's are well below complete penetration but data issampled in a random fashion, value ranges returned as a result of aquery can have an additional uncertainty contributed by sampling error.In addition, sample bias, which can occur, for example, when a sampledindividual knows of such sampling, or simply as a result of a differencebetween those willing to be sampled and those unwilling to be sampled,can cause additional sampling error. The methods outlined below canaddress these complicating issues, and can calculate error ranges forvarious datasets.

[0131] As data is collected by the present invention, a 2-dimensionalarray holding the number of matching events between all combinations ofdemographic characterizations is kept in a specification similaritymatrix, illustrated by Block 1026 in FIG. 10. For each demographiccharacterization combination, the Pearson-r correlation can be computedwith respect to all pre-defined events.

[0132] Over time, IDM values for demographic characterizations orcombinations of demographic characterizations can become predictors fordemographic characterizations, within determinable statisticalconfidence intervals. Such demographic characterizations can be combinedto create a similarity index. The present invention can use a similarityindex to determine probability ranges for various levels of demographiccharacterizations that can be used by a possibilities reduction systemof the present invention.

[0133] The present invention may also apply another assumption todetermine a combination of demographic aspects that can have arelationship on television viewing. Such relationships can be determinedby applying rules to demographic aspects over time. Such rules mayinclude, but are not limited to, additive, subtractive, and dominance orrecessiveness rules. If an IDM2 value for a level-two characterizationis greater than the IDM1 value for both demographic aspects comprisingthe level-two demographic characterization when viewed alone, that valueis said to correspond to an additive rule. If an IDM2 value for alevel-two demographic characterization is less than the IDM1 value forboth level-one demographic characterizations when viewed alone, thatvalue is said to correspond to a subtractive rule. If an IDM2 value fora level-two demographic characterization falls between IDM1 values forthe respective demographic aspects of the level-two characterization,that behavior is deemed to be dominant/recessive, and the IDM1 valueclosest to the IDM2 value is deemed the dominant IDM1 value.

[0134] A multi-dimensional array can be kept which records rulesappropriate to each IDM relationship for each event over time. Such anarray may also be extended to include rules comparing multi-aspectcombinations. Statistical tests can be applied to determine confidenceswith which an aspect or combination of aspects is related to anotheraspect or combination of aspects by any of the given rules. Weights maybe assigned to each rule, with a preferred embodiment using linearlyhigher weights to represent exponential rule growth.

[0135] Values produced by such a system may be stored in an array, orrecombination matrix. Rules can be applied additively for a demographiccharacterization of a particular level from lower level IDM solutions atwhich such rules are identified. Final confidences can be determinedthrough Pearson-r correlation with IDM calculations. Demographiccharacterization recombination matrices can aid in the calculation ofprobability ranges.

[0136] A mean of each demographic specification's event matchingrepresentation may be stored over time in aspect representation indices.Such one-dimensional arrays can also be used in confidencedeterminations for information obtained through other portions of thepresent invention. These indices can be updated as IDM calculationscontinue through STB event time. Aspect representations can give thepresent invention approximate sample sizes for behaviors of eachdemographic characterization.

[0137] Aspect representation indices can also be used when determiningindividual behaviors. To determine such behaviors, probabilities can beassigned to each demographic specification and each STB event. If an STBmatches an event, corresponding probabilities can be ascribed to an STB,where such probabilities are normalized so the highest probability isunity. If an STB does not match an event, probabilities ascribed to suchan STB may be the linear inverse of probabilities ascribed to an STBmatching an event. For each STB and demographic specification at a giventime, summing ascribed probabilities for each demographic specificationand dividing by the number of probabilities can compute the probabilitythat an STB corresponds to a given demographic specification. A one-wayanalysis of variance can then be performed on such data to determine thelikelihood of such data representing a user of a respective STB.

[0138] As previously discussed, confidences with which a demographicspecification can be linked to an STB can be useful in refining datagenerated by the present invention. This usefulness arises out of thefact that combining relative confidences with a level-n demographiccharacterization yields a level-n+l demographic characterization. Suchrelative confidences are also useful when evaluating assumptions, orrules, generated by the present invention. Such assumptions may begenerated in the above-described aspect recombination rules,specification similarities, aspect representation, and individualbehavior determination processes.

[0139] Information developed by each assumption has an empiricalvalidity, which can be converted to a statistical confidence. Thisempirical validity can be determined over STB event time by theassignment of expectation values, tracking of empirical values, andthrough time-correlation tests between expectation and empirical values.Each assumption's validity may require determination through a slightlydifferent statistical formula.

[0140] Assumption validities, along with sample numbers and samplevalues, can provide probabilities and ranges for information concerningdemographically specific groups. The present invention may translateinformation into a chosen confidence interval, then, for eachspecification and each process performed, sets of individuals matchingand not matching all possible events can be produced.

[0141] Through this process, each demographic specification group can belabeled with statistical “guesses” at a final range of values for anevent rating. A system of linear equations can then be solved to furtherreduce the ranges and fill in residual gaps left by such processes.Demographic numbers for each demographic characterization can also beused to further reduce the ranges by this possibilities reductionsystem. By way of example, without intending to limit the presentinvention, imagine a set of viewers matching a set-top box eventcontains 300 people who are Asian, and this same set contains 200 peoplewho earn over $80,000 per year. If it is known that all people in theset who earn over $80,000 per year are Asian, then the 100 remainingAsian people can be placed into categories corresponding to incomesbelow $80,000 per year. Some of these categories may already be reduced,so information can be filled in quickly.

[0142] Essentially, while an IDM and other processes of the presentinvention may have increasingly low confidences as demographicspecificity levels increase due to low sample representation for many ofdemographic characterizations at these higher levels, a possibilitiesreduction process can take advantage of these low numbers by filling indemographic characterizations with certain matches or non-matches of STBevents. A possibilities reduction system can accomplish this becausedemographic data is known at these levels, and thus reduces the set ofpossibilities remaining.

[0143] Ranges can be reduced even in cases where demographiccharacterization numbers are not filled completely. The possibilitiesreduction system may use linear algebra rules to describe the remainingpossibilities in terms of mathematical symbols after each piece ofinformation is considered. A matrix of specific probability ranges andtheir mathematical relationships with other specific probability rangesfor a given event can be iteratively updated in a pre-determined orderuntil such iterations do not significantly change any matrix values.

[0144] Ultimately, ranges that cannot be further reduced may not bemodified by this iterative process without first modifying theconfidence level. For example, by decreasing the confidence level (forexample, from 90% to 88%), ranges may be reduced and some demographiccharacterizations be filled in. An iterative process can then go throughall characterizations and iteratively again reduce value ranges as muchas possible. At some point, the confidence level in the matrix may beincreased to its original level. However, additional mathematicalprocedures may be introduced to avoid an error function of the of thismatrix getting trapped in a (spurious) local minimum.

[0145] It is important to note that even if a customer is only satisfiedwith confidences associated with level-five data, the relevantlevel-five characterizations can be composed of any five demographiccategories about which the present invention collects demographic dataand in which a customer may be interested.

[0146] In addition to a probability reduction process, a preferredembodiment of the present invention may further reduce the likelihood oferrors by employing a Monte Carlo-type fit to such data after aprobability reduction process has been applied. Such a fit may accountfor bias issues associated with data collected by the present invention.The present invention may further address bias issues by monitoringset-top box usage and identifying inconsistent behavior. For example,the present invention may detect when households leave a set-top box onwhile playing a DVD, videotape, or even while on the telephone or whenout of the home for extended periods by analyzing usage patterns foreach set-top box. By way of example, without intending to limit thepresent invention, the present invention may define “special events”which correspond to situations in which a particular set-top box remainson but no state changes occur over a period that is significantly longerthan an average state-change interval for that set-top box.

[0147] The present invention may also be integrated with recordingsystems, such as the Replay-TV and Tivo systems, to allow more detailedanalysis of consumer behavior. Recording systems may be of interest inthe present invention because of the level of control a user has over agiven program, including the ability to pause live programming.

[0148] An alternative calculation method which may be used by IDGMCalculation Algorithm 940 essentially involves a more additive, yetthorough approach based on the demographic assumption to determine DATA2values for DATA1 data. However, rather than defining an error functionto be minimized, this method represents a lengthy process of linearalgebra by essentially comparing each zip code to each other zip codefor each demographic characterization in a tailored customer query. Thiscalculation method is an alternative mathematical method that solves thesame problem as the previous method.

[0149] The present invention may essentially translate customer queriesinto a query of the type “What percentage of a certain demographicspecification performed some action during a program state?” andcalculate the result of such queries. Definable Viewing Units (“DVU's”)include whole or partial content that can be matched to time andlocation. Customer queries can take the form of combination of actionsand DVU's.

[0150] Customer queries can be translated to a final query throughdetermination of a plurality of values within subsystems of the presentinvention and then operating on those values to derive a final queryvalue. Subsystem query values can be determined by a process computer,which employs multiple sub-processes to determine each sub-value. Aprocessor can convert any percentages generated by such sub-processes tonumbers, prorate and extrapolate such numbers to include non-sampledindividuals, and accounts for those STB's at presenting content tomultiple viewers.

[0151] A process computer may receive queries specific to a demographiccategory. A process computer can break apart a query into twocomponents; a set of zip codes for which the query will return data, andall other query components. For each zip code in a query, which may bereferred to as a “query zip,” the query zip can be matched against otherzip codes for which the present invention collects market data. Thisprocess will result in n−1 zip code combinations (“zip-zipcombination”), where a query zip is the first combination, and where nis the total number of zip codes about which data is collected by thepresent invention. These zip-zip combinations can be run through amulti-step process that determines final query values.

[0152] The first of these processes, CP1, determines weights for eachzip-zip combination. CP2 determines a value each zip-zip combination caninfer for the final query, or for some sub-query. CP3 can determine ifthere is more than one pattern of inferred query values and, if thereare, determines a set of values from the current dataset. CP4 reducesvalues from CP3 by one level. CP3 and CP4 can be iteratively acted uponuntil only one pattern is apparent, and a result set can be determined.

[0153] CP1 determines weights for each zip-zip combination based on avariety of factors. One such factor is the percentage of a test zipcode's population that falls into the query category in question. In apreferred embodiment, a test zip code's population should be asdifferent as possible from a query zip. This provides a resolutionincrease, as the larger the differences between the categories, thebetter the effect of individual common points can be determined.

[0154] For each category other than zip codes, CP1 may give higherweight to those categories for which the population of the test zip issimilar to the query zip. The importance of these other categories canbe determined to make an overall determination of how similar the testzip is to the query zip in terms of such other categories. By way ofexample, without intending to limit the present invention, if testzip(1) is ninety percent similar by age distribution and ten percentsimilar by religious affiliation, test zip(1) will likely be assigned ahigher weight than one that is ten percent similar by age distributionand ninety percent similar by religious affiliation. The relevance ofeach category, and hence its weight, can be determined by the effect acombination of zip codes seem to have on market share for a given query.

[0155] The present invention can determine a weight for a given test zipby the formula Weight=(c)(W), where c is the percent difference betweenpersons belonging to a query category in a test zip and those belongingto the same query category in a query zip.

[0156] As an example, without intending to limit the present invention,if CP1 received a query including query category ‘African American’, andthe following were true for sample query zips and test zips:

[0157] query zip: 10% AA+90% other=Market Share=20%

[0158] test zip: 20% AA+80% other=Market Share=40%,

[0159] then c may be calculated as a percent difference between 10.0 and20.0.

[0160] W can be determined by summing weights assigned to each “other”category, where other categories are defined as categories in the unionset of categories between a query zip and test zip, except for a querycategory or any sub-categories defining a query category. In equationform, this can be written as W=sum(w[x])=sum (w[1] . . . w[x]), where xis a numerical identifier of each category, and x is incremented overthe number of categories.

[0161] For each category:

[0162] Let %d(x,y)=the absolute value of a percent difference between‘x’ and ‘y’

[0163] Let a=%d(A[0], A[n]), where A[0] represents a demographicpercentage in query zip for a given demographic category, and A[n]represents a demographic percentage in a test zip for a givendemographic category.

[0164] Let b=%d(B[0], B[n]), where B[0] represents a market share for agiven demographic category in a query zip, and B[n] represents a marketshare for a given demographic category in a test zip.

[0165] Let each category weight w(x)=q(a)×q(b)

[0166] where q(a)=the weight of ‘a’ and q(b)=the weight of ‘b’ where:

[0167] q(a)=f(a)=approximately 1/a

[0168] q(b)=f(a,b)=a function in which:

[0169] as ‘a’ goes to 0 and %d(a,b) goes to 0, q(b) goes to infinity,and

[0170] as ‘a’ goes to infinity and %d(a,b) goes to 0, q(b) goes to 0.

[0171] For example, if a query category is African American, categoriesin a complete union set of categories between the query zip and test zipwhich do not include African Americans should be reviewed. One suchcategory may be White Male, and by way of an example, the following maybe true:

[0172] query zip: 24% WM+76% other=Market Share=20%

[0173] test zip: 30% WM+70% other=Market Share=40%

[0174] The statistics above can be rewritten as follows if the “other”categories and their respective percentages are disregarded:

[0175] query zip: A(0) WM=B(0), where A(0)=24%, B(0)=20%

[0176] test zip: A(n) WM=B(n), where A(n)=30%, B(n)=40%

[0177] An optimum weight function for a test zip code is an intrinsicproperty of its relationship to a query zip. q(a) can be seen as ameasure of a percent similarity of a given category between query andtest zips, and q(b) can be seen as a measure a category has ondescribing market share differences between query zip and test zip.Thus, categories receiving heavier weights are those in which %d(A[0],B[n]) is low or null and at these low values %d(B[0], B[n]) is a similartrended value. In a preferred embodiment, As %d(A[0], B[n]) increases,that %d(B[0], B[n]) should decrease.

[0178] A natural function fitting q(a) and q(b) requirements will be onethat provides a proper optimization of zip codes, such as a Laplacianfunction. It is clearly a symmetrical surface in 3D space with fourspecific boundary conditions. If a specific function can not be foundthat provides such a fit, a practically optimal function can be createdthrough power series.

[0179] CP2 can evaluate the percentage of persons falling into a querycategory in a query zip and the percentage of persons falling into thequery category in all test zips, and use this to evaluate specificmarket share differences between the query zip and each test zip codes.CP2 then evaluates the information about these two zip codes, andestablishes a best guess as to the query category differences thatcontributed to an observed market share difference.

[0180] CP2 can take demographic and market share informationcorresponding to individual zip codes and solve a set of linearequations for variables involved. One variable resulting from such asolution can be related to a percentage of persons falling into a querycategory for a given zip code or set of zip codes. Another such variablemay be the percentage of persons in a zip code not falling into acategory. CP2 may set the right-hand side of each equation to the marketshare of a query DVU.

[0181] By way of example, without intending to limit the presentinvention, a given zip code or set of zip codes may yield a formula suchas: 10AA+0.90o=0.20. Such a formula may indicate that a zip code or setof zip codes has a demographic makeup that is ten percent AfricanAmerican and ninety percent “other.” To further ease understanding ofthis example, assume the DVU is the percentage of TV sets that watchedall of the show ER from 10.00.00 to 11.00.00 on Nov. 11, 1999. In theabove example, the market share for this DVU is 20 percent. It should beclear to one skilled in the art that in the equation above, the variable‘AA’ refers to the percentage of African American persons in a given zipcode fulfilling DVU criteria.

[0182] A preferred embodiment of present invention may represent such anequation using the following three matrices for computational clarityand storage efficiency: [AA o] [10 90] [20]

[0183] CP2 may translate information for all query and test zip codesinto this form for a query category and for other categories in whichnone of the persons making up those category percentages fall into aquery category. If the zip code, category, and DVU example above wereexpanded to include a test zip(n), there would be 3 matrices again, butthis time 2-dimensional: [AA o] [10 90] [20] [AA o] [12 88] [19]

[0184] Relationships between such matrices and their resulting linearequations can be seen when the linear equations are representedgraphically by a line in 2-dimensional space. If two lines are on thesame plane, such lines must either 1) be the same line, 2) be parallellines, or 3) cross at some point. Through these relationships, CP2 cancreate a “best guess” at a mutual answer for variables in the equationsbased on information the equations imply. This guess need not fall ontoa point in the interval of either equation.

[0185] This process can be seen as analogous to drawing a set of linesin 2-dimensional space and then finding intersections of these lines.This intersection may occur once per two dimensions, indicating twoparticular variables. The first of these variables relates to aparticular category of interest, and the second variable is always‘other’.

[0186] Statistically, with several test zips with which to work and eachwith a given number of samples, CP2 may be seen as analogous to samplinga much larger number of samples than are in any individual sample zip,and drawing a normal curve for each. The resulting curves may then haveeach curve subtracted from them, and one normal curve may be built fromthe differences. In some cases, this curve may be a line representing asingle value, rather than a curve.

[0187] At this point, the normality of an extrapolated curve is nolonger an assumption, as only one specific category variable is ofconcern at a given time. Any “other” values not under the peak of acurve represent a mathematical confidence level based only on samplesize versus population error, and not a statistical skewing possibility.

[0188] CP3 can take information from CP1 and CP2 in the form of a seriesof test zip codes and the best guess value for each. Based on this data,CP3 can then determine whether such best guesses exhibit aone-dimensional pattern. If such a pattern exists, CP3 can return thepeak of the pattern as the query return value. If such a pattern is notexhibited, CP3 may pass data to CP4.

[0189] CP4 takes CP3 data and attempts to establish a one-dimensionalpattern from it. CP4 can run multiple queries through CP1, CP2, and CP3using each zip code in the present invention's sample population as aquery zip. For each iteration through CP3, a phantographic categorypercentage can be assigned to each zip code. These can be reinserted toCP1, CP2, and CP3 using a query zip and its assigned phantographiccategories in place of demographic category percentages.

[0190] With each iteration, the complexity of CP4 data is reduced by onelevel. When CP4 data exhibits a one-dimensional pattern, this result canbe returned as a final query value. If CP4 data exhibits more than aone-dimensional pattern, the data can be fed back through CP1, CP2, andCP3 for additional analysis, using new phantographic categorypercentages with each iteration. Through this process, CP3 and CP4 canidentify those demographic categories having a high effect onviewership.

[0191] As previously described, CP3 can receive data from CP1. Such datamay be in the following form: test zip 1 zip 1 weight test zip 2 zip 2weight test zip 3 zip 3 weight . . . test zip n zip n weight total =1.00

[0192] In addition to data from CP1, CP3 may also receive data from CP2.Such CP2 data may resemble the following: test zip 1 query zip-zip 1linear equation solution value test zip 2 query zip-zip 2 linearequation solution value test zip 3 query zip-zip 3 linear equationsolution value . . . test zip n query zip-zip 3 linear equation solutionvalue

[0193] CP3 can multiply solution values for each zip code by the weightof that zip code to get a sub-answer. These sub-answers can then besummed to yield a final answer.

[0194] While CP3 can generate query results based on linear equationsolutions, CP3 can also generate and analyze graphs of zip code weights(w) versus linear equations solution values (v). An example of such agraph is illustrated by FIG. 12.

[0195] CP3 may reduce the influence of random noise on such a graph byassigning each w the average of its value and the values on either sideof it. This procedure may be repeated until all noise is reduced toacceptable levels. The result would be a graph with one of the followingconditions:

[0196] 1) one defined peak from one direction;

[0197] 2) one defined peak from two directions;

[0198] 3) more than one defined peak, possibly with varying heights; or

[0199] 4) a random curve.

[0200] If a curve exhibits conditions outlined in numbers 1 or 2 above,a near-perfect guess at an ultimate query return value can be made. Sucha value may be based on a peak value observed, or may result fromextrapolation of a graph over a larger data range.

[0201] If a curve exists exhibiting conditions outlined in number 3above, which is illustrated by FIG. 12, then at least one additionalcategory may be effecting query data. A random category can be created,called a phantographic, and relevant percentages can be assigned to thatcategory to account for peak distributions observed in the graph. CP4can then feed this phantographic category back into CP1 and CP2 for thequery in question to determine an appropriate query return value.

[0202] By way of example, if the data in FIG. 12 represented a 60/40distribution, the 60 percent distribution may be assigned to zip-zipcombinations where test zips are directly under parts of the graphassociated with the 60 percent peak. The 40 percent distribution may beassigned to zip-zip combinations where test zips are directly underportions of the graph associated with the 40 percent peak. Test zipsfalling under valleys between such peaks may be ignored. Through thissystem, the present invention should account for categories affectingmarket share differences without requiring tracking of a large number ofcategories.

[0203] As CP3 feeds data to CP4, CP4 can assign phantographic categorypercentages to each zip code, then route the result through CP1, CP2,and CP3, thereby reducing underlying patterns by one level. Thefollowing is an example of calculations and procedures employed by CP4.

[0204]FIG. 13 is a sample graph generated by CP3 for a query category,zip code, and market share. CP2 values are arranged in increasing orderfrom left to right, with higher lines indicating heaver weighting for agiven value. When CP2 generates two or more of the same value, theweight represented on such a graph may be increased to the sum of theindividual weights for all matching values.

[0205]FIG. 14 provides a more detailed view of the graph in FIG. 13. Thesum of all weight values in the graph should total 1, and all weightsmust fall between 0 and 1. This is due to the prorating step performedafter CP1. FIG. 15 provides an additional view of FIG. 14, with valuesprovided for various points on the graph, and zip codes numbered 1through 25. The following is a table of values illustrated by FIG. 15:Zip Code weight (to 3 significant digits) value (to 2 significantdigits) 1 0.402 0.50 2 0.515 0.79 3 0.027 0.23 4 0.040 0.71 5 0.038 0.906 0.070 0.29 7 0.054 0.41 . . . . . . . . . . . . . . . . . . . . . 25 0.052 0.26

[0206] In a more abstract sense, data passed from CP3 to CP4, whengraphed, may resemble FIG. 16, FIG. 17, or FIG. 18. FIG. 16 illustratesCP3 data resulting in a graph with a single peak. FIG. 17 illustratesCP3 data resulting in two peaks. FIG. 18 illustrates CP3 data with manypeaks. While CP3 data may resemble one of these three figures, CP4 doesnot distinguish between such cases.

[0207] If, for a given query, CP3 resulted in a graph similar to FIG.16, this may result in a great deal of confidence in a query resultvalue. Such a value may be either the sum of the means of each valuetimes its respective weight or simply the value underneath the peak ofthe graph. Assuming a symmetrical and “natural” weighting function oralgorithm, CP3 data yielding FIG. 16 would provide validity to theassumption that persons with more similar graphic makeup have moresimilar television viewing patterns. The present invention can answerthe questions customers truly intend to ask, as any quality shared bythat group alone is represented in the CP1, 2, 3 algorithm. This alonecan represent a significant improvement over statistical methodsemployed in the prior art.

[0208] Rather than the single-peak graph of FIG. 16, if CP3 were toresult in a graph similar to FIG. 17, the correct answer is not one thatwould be given by a full average of all data, as this would provide avalue between the two peaks. The correct answer is the value on the Vaxis directly under the first peak, or the value on the V axis directlyunder the second peak. CP4 can select an appropriate value from two ormore such values.

[0209] The fact that there are two peaks in FIG. 17 indicates thatanother demographic category should have been part of the graphic data.When two peaks exist such as in FIG. 17, there is only one differencebetween the set of values forming the peak on the left versus the set ofvalues forming the peak on the right. While CP4 may not determine whatthis difference relates to, CP3 data is based on zip codes, and thedifference is likely to be geographic. For example, it may be that thepeak on the right is formed by zip codes in coastal cities, and the peakon the left is formed by non-coastal cities. Regardless of the source ofthe difference, FIG. 17 clearly illustrates its existence.

[0210] If all graphic categories are ignored except the query category,a new set of categories, called phantographic categories, can beassigned to test zips making up the chart distribution. A value of 0 canbe assigned to test zips creating the left-hand peak, and a value of 1can be assigned to those test zips creating the right-hand peak. If thisdata set is now run through CP1, CP2, and CP3, a similar but moredisperse graph should result. If all zip codes sampled by the presentinvention were iterated through as query zips, and all others were testzips, and the value belonging to each peak was assigned to those testzips beneath it as the phantographic category percentage for each queryzip in the system, this data could be run through CP1, CP2, and CP3.

[0211] This would result in a new graph for the query zip. Such a graphshould result in all phantographic categories having similar percentagedistributions for a given zip code. Further, a percentage exhibited by aquery zip can be determined, and thus the initial peak associated withthe query zip can be properly selected by CP4.

[0212] The present invention may further employ a “category generator”,which can search demographic, geographic, and other databases for agraphic distribution percentage matching any recurring phantographicdistributions. While many are likely to remain undiscovered, if one isfound it may be added to a list of categories monitored by the presentinvention.

[0213] Due to CP4, patterns in CP3 data cannot escape the presentinvention. Even algorithms at the heart of data generated by a randomnumber generator could be determined through CP4's iterative processes.

[0214]FIG. 5 is a block diagram of modules used in Graphic Dataacquisition. As FIG. 5 illustrates, the present invention may acquiredata for Interval-updating Graphic Database 620, and ultimately for IDGMGraphic Matrix 910, from outside sources such as Graphic Vendor 610.However, as also illustrated in FIG. 5, the present invention may alsogarner data for Interval-updating Graphic Database 620 from Evolving STBViewer Possibilities Database 960.

[0215] Evolving STB Viewer Possibilities Database 960 may include, foreach set-top box, behaviors and demographics associated with one or moreregular users of said set-top box. Evolving STB Viewer PossibilitiesDatabase 960 may also identify a set of set-top boxes whose monitoredbehavior best fits demographics for a given geographic region. In apreferred embodiment, regional demographics may be complied fromspecification percentages of such set top boxes. In an alternativeembodiment, specification percentages may be fit by the weight of theirmagnitudes. The present invention may correlate and assign demographicsto set-top boxes, and the sum of these can be combined to indicatedemographics for a specific region.

[0216] While Interval-updating Graphic Database 620 may initiallyreceive data from Graphic Vendor 610, Interval-updating Graphic Database620 may not always require such data. As the quality of data stored inEvolving STB Viewer Possibilities Database 960 increases, the presentinvention may no longer require data from Graphic Vendor 610.

[0217] Block 460 identifies a process that produces data which may raiseprivacy concerns. As illustrated in FIG. 5, the present invention canprotect such data by restricting access to such data to only componentsof the present invention. Information stored inside Block 460 can onlybe accessed by customers through components of the present invention,such as Block 940.

[0218]FIG. 6 is a block diagram illustrating modules used by anIndividual Behavior Determination System of the present invention toacquire individual behavior data. As illustrated by FIG. 6, IDGMCalculation Algorithm 940 may extract event data from Tuner Data Center930. IDGM Calculation Algorithm 940 may also perform event analysis.

[0219] As with other parts of the present invention, an individualbehavior system may use the demographic assumption at its core. Anindividual behavior system may also assume that statisticaldeterminations can be made that define a viewer of an STB by measuringSTB states over time and ascribing probabilities to an STB. Individualviewers of an STB can then be statistically identified based onbehaviors exhibited on an STB.

[0220] The mean of all ascribed specification percentages may be kept inEvolving STB Specification Percentage Database 950, illustrated in FIG.6. Evolving STB Specification Percentages Database 950 may comprise, foreach specification, a percentage of time that a behavior associated witha set-top box matches a particular specification. Viewed over time,Evolving STB Specification Percentage Database 950 may contain a “fuzzy”percentage, or Identity Percentage, for each specification, which may bedetermined for each set-top box.

[0221] The “Attribution Percentage” ascribed to an STB for a given querybehavior ‘n’ can be calculated through a piecewise function:

AP_((n))=

[0222] If YES: behavior match % of that specification;${If}\quad {{NO}:{\left( {{1/{Process}}\quad 1\% \quad {of}\quad {spec}} \right)/{\sum\limits_{1}^{s{(c)}}\left( {{1/{Process}}\quad 1\% \quad {of}\quad {spec}} \right)}}}$

[0223] Where s(c) is the total number of mutually exclusivespecifications of the query category/categories.

[0224] The “Identity Percentage” of any demographic specification at anytime (t) is simply:${IP}_{(n)} = {\sum\limits_{1}^{n{(t)}}{{AP}_{(n)}/n}}$

[0225] Where n(t) is the total number of queries to date involving thezip code in which the sample exists.

[0226] The following is included by way of example, without intending tolimit the present invention. Below are five sample queries submitted tothe present invention over the first two weeks of operation, and sampleresults generated by the present invention. These queries involve thecategory “race” for the zip code 34208. Query #: 1 Behavior:watched >80% of Friends, 8:00 pm-8:30 pm on 3/23/00 Black: 08% 16% ofbehavior matches White: 23% 46% of behavior matches Asian/PacificIslander: 12% 24% of behavior matches Other: 07% 14% of behavior matchesQuery #: 2 Behavior: muted channel at first commercial of Lakers vsSuns, 8:00 pm on 3/24/00 Black: 02% 10% of behavior matches White: 10%50% of behavior matches Asian/Pacific Islander: 05% 25% of behaviormatches Other: 03% 15% of behavior matches Query #: 3 Behavior:watched >80% of Friends, 8:00 pm-8:30 pm on 3/30/00 Black: 27% 54% ofbehavior matches White: 03%  6% of behavior matches Asian/PacificIslander: 12% 24% of behavior matches Other: 08% 16% of behavior matchesQuery #: 4 Behavior: watched >80% of Seinfeld rerun, 7:30 pm- 8:00 pm on3/31/00 Black: 10% 17% of behavior matches White: 24% 40% of behaviormatches Asian/Pacific Islander: 14% 23% of behavior matches Other: 12%20% of behavior matches Query #: 5 Behavior: watched >40% of Seinfeldrerun, 7:30 pm- 8:00 pm on 3/31/00 Black: 22% 28% of behavior matchesWhite: 11% 14% of behavior matches Asian/Pacific Islander: 23% 29% ofbehavior matches Other: 24% 30% of behavior matches

[0227] To further refine this example, assume that a particular STBexhibited the above behaviors as follows, and thus percentagesattributed to that STB for the category of “RACE” were as follows:Resulting AP(n) Contribution Behavior # n Exhibited BLA WHI API OTHBehavior #1: 1 YES 16% 46% 24% 14% Behavior #2: 2 YES 10% 50% 25% 15%Behavior #3: 3 NO 06% 58% 14% 22% Behavior #4: 4 YES 17% 40% 23% 20%Behavior #5: 5 NO 23% 46% 22%  8% ΣAP(n) 72% 240%  108%  79% IP(n) =ΣAP(n)/5 14.4% 48.0% 21.6% 15.8%

[0228] Below are the results of a sixth sample query submitted to thepresent invention and sample Identity Percentage (IP) calculations forthe “race” category: Query #: 6 Behavior: watched >90% of Law & Orderrerun, 7:00pm-8:00pm on 3/31/00 Black: 02% 08% of behavior matchesWhite: 14% 56% of behavior matches Asian/Pacific Islander: 04% 16% ofbehavior matches Other: 05% 20% of behavior matches Resulting AP(n)Contribution Behavior # n Exhibited BLA WHI API OTH ΣAP(n − 1) 1-5 N/A72% 240% 108% 79% Behavior #6: 6 YES  8%  56%  16% 20% ΣAP(n) 80% 296%124% 99% IP(n) = ΣAP(n)/6 13.3% 49.3% 20.7% 16.5%

[0229] These results would seem to suggest that STB corresponds to“White.” Six queries may not be enough to decide such a conclusion. Theprobability that a person actually is some IP(n) specification can bederived simply through straightforward statistical methods.

[0230] Referring again to FIG. 6, the number of viewers for a set-topbox may be constrained by Interval-Updating Graphic Database 620.Evolving STB Specification Percentages Database 950 can receive inputfrom IDM Calculation Algorithm 940, which may run continuously on alltime-possible events.

[0231] The present invention may also utilize a second evolutionarydatabase, illustrated in FIG. 6 as Evolving STB Viewer PossibilitiesDatabase 960. A goal of Evolving STB Viewer Possibilities Database 960is not to determine who is using a set-top box, but rather who may beviewing particular content. Thus, Evolving STB Viewer PossibilitiesDatabase 960 may track possible or probable users of a particularset-top box, regardless of whether an individual, a couple, a family, ora large group of people is watching. Evolving STB Viewer PossibilitiesDatabase 960 may be updated at regular intervals, but optimally may beupdated at every update of Evolving STB Specification PercentageDatabase 950.

[0232] Data from Evolving STB Specification Percentages Database 950 maybe best fit to the demographics of an area, and this best fit may beheld in Interval-Updating Graphic Database 620. A best fit may becalculated by spot-filling, in which the highest set-top boxspecification percentages fill the most demographically significantspots first. This spot-filling process continues until all spots arefilled. At any time, if a category has been completely accounted, nomore spots may be taken and any relevant specification can be discarded.

[0233] As a result of such calculations, Interval-Updating GraphicDatabase 620 ultimately holds a “best guess” at the completespecification makeup of each individual in a household. Spot-filling maynot be the most accurate technique, however, as a best fit reducesoverall deviation. Due to higher specification percentage weighting,spot filling may provide an extremely close approximation.

[0234] The evolving database and best guess technique outlined abovehave been described in examples which determined viewing behaviors for agiven home. However, the present invention can also account for personsin a bar watching a football game on Sunday, even if they are alreadycounted at home.

[0235] To determine individual persons, and not just a group percentage,matching a behavior, current set-top box viewer possibilities may bebest fit to IDM calculation specification percentages for any event forthe population of some graphic region or set of regions. In this way,individual behaviors can be determined. This best fit may be performedby spot-filling in a manner similar to that outlined above. Every personcan be accounted for by the present invention, whether are at a bar, aneighbor's home, or their own home. The IDM specification percentage maybe fit to evolutionary specification percentages for each box, therebyaccounting for such deviations.

[0236] Ultimately, for each event, individuals matching a behavior maybe known, and such data may be sent to Individual Behavior Determiner970 and stored in Individual Behavior Database 290. Individual BehaviorDatabase 290 may hold all individual behaviors recorded since samplinginception. Individual Behavior Database 290 may comprise a database oftime-oriented arrays containing information about what each sample hasdone since sampling inception. Individual Behavior Database 290 maycomprise a database approximating individual behaviors for each eventfrom IDM Calculation Algorithm 270 or 940.

[0237] Individual Behavior Determiner 970 comprises a linear system thatcan find a best fit between IDM Calculation Algorithm 270 or 940 andEvolving STB Viewer Possibilities Database 960. Individual behaviors maybe approximated as a best fit of the data groups from IDGM CalculationAlgorithm 940 and Evolving STB Viewer Possibilities Database 960 foreach event. In this manner, the behavior of one individual may betracked over time.

[0238] New specification percentages from Evolving STB PercentagesDatabase 950 may be continuously available and periodic recalculation ofindividual behaviors may be preferred. Percent changes in Evolving STBPercentages Database 950 may be small and determinable through time, andthis may be the factor used to determine the interval of individualbehavior recalculation.

[0239]FIG. 7 is a block diagram of modules comprising Future EventsQuery System 190 of FIG. 1. Future Events Query System 190 may bedifferent from Past Events Query System 200. Future Events Query System190 may comprise a web-based system, which interacts with PostTranslation System 420, allowing Market Customer 530 to query thepresent invention regarding behaviors that are most likely to occur inthe future. Future Events Query System 190 may include a web-basedsystem with a natural language, graphical, or command-line interface,providing the customer with the ability to extract information from thesystem.

[0240] Individual Behavior Database 290 can contain individual-specificbehavior information as determined by the present invention. Individualbehaviors may be analyzed by Series Analysis System 980, which cancomprise a system that looks for data trends and patterns (both directedand undirected).

[0241] Series Analysis System 980 may comprise an algorithm looking fortrends in individual and group viewing behaviors that may efficientlydefine relevant behavior patterns. Such series analysis algorithms maybe time based, defined behavioral events, and undefined behavioralevents, such as a straightforward series that determines behavioralpatterns. These algorithms may take individual behaviors from IndividualBehavior Database 290 and look for trends based on time, content,channel changes, and the like. Series Analysis System 980 may alsodefine relevant events according to results of its analysis.

[0242] Relevant viewing events may comprise events that represent somepattern describing viewing behaviors. Events may be pre-defined eventssuch as “changing the channel at the end of a show” or “changing thechannel at the beginning of a commercial,” or they may be definitionsthat are more arbitrary, such as “changing the channel twice in 5seconds before changing the channel 3 times in the next 10 seconds.”Events may result from individual behavior data mining based on seriesanalysis and behavior pattern determination, and then reporting suchpatterns in a simple form. By way of example, the present invention maylearn that a particular graphic category may not get home until 6:00 pm,then views a news channel for half an hour, then turns the set-top boxoff for an average of a half hour, presumably for dinner.

[0243] One series analysis that may be performed by the presentinvention is a Time Series Analysis (“TSA”). In a TSA, trends that canbe fully described as a function of time may be identified. Thisdistinction is made since most mathematical series analysis methods areusually referred to as ‘time-series analyses.’

[0244] An alternative series analysis that may be performed by thepresent invention is a Defined Event Series Analysis (“DESA”). A DESAcan identify trends which may be fully-described as a function of a setof behaviors, such as changing the channel near the beginning or end ofan hour or near the beginning or end of certain content, watching entireprograms, watching certain genres, and the like. A DESA can allow thepresent invention to identify not only those features that are ofinterest to the present invention, but also to customers of the systemas well.

[0245] Still another series analysis that may be performed by thepresent invention is a Undefined-Event Series Analysis (“UESA”). A UESAis similar in many respects to a DESA, except that a UESA can look forgeneral trends while defining its own events. By way of example, withoutintending to limit the present invention, sampled individuals may be ontime schedules or have general viewing habits about which the presentinvention may soon learn.

[0246] After series analysis, events that the present inventionidentifies as describing behaviors in Individual Behavior Database 290may be sent to Event Definition System 400, where an emphasis may beplaced on behaviors that are more current. Event Definition System 400can also accept input from Updating Future Airings Database 990 whichmay hold a content guide that may include content attributes and contentpresentation information. Such content presentation information mayinclude, but is not limited to, to networks or channels presenting suchcontent, and times such content was or will be made available.

[0247] Event Definition System 400 can break down programming intoevents defined efficiently by Series Analysis System 980. EventDefinition System 400 may comprise an algorithm that may accept a broadrange of content attributes. Event Definition System 400 may also breakcontent apart into a best fit of events as determined by Series AnalysisSystem 980.

[0248] Updating Future Airings Database 990 may comprise a database ofarrays holding a best guess at what content may be aired at any time inthe future. Airings Source 360 may continuously update Updating FutureAirings Database 990. For times relatively far into the future, generalextrapolations may be made to save data space.

[0249] Future content may be broken down in terms of viewership eventsexisting in Event Definition System 400. From these datasets, FutureEvents Mapping System 450 may map future individual and group eventsonto future content by describing it in terms of the events defined bythe Event Definition System 400.

[0250] Future Events Mapping System 450 may comprise a simple algorithmthat linearly forecasts the most probable events onto a map of futureprogramming. Future Events Mapping System 450 may comprise an algorithmthat takes input from Event Definition System 400 and maps probablemutually exclusive sets of behaviors which Series Analysis System 980forecasts for these events. Events of both systems may be identicallydefined, thus requiring only a best fit mapping of individual behaviorsonto the future programming.

[0251] Market Customer 530 may query Future Events Query System 190,which may comprise a user-interface with options for tailoring suchqueries. Post-Translation System 420 can translate such queries into amathematical formula that may be understood by Pre-Translation System410. Such translations may simply express a query in a format thatfacilitates data extraction from Future Events Mapping System 450.

[0252]FIG. 8 is a block diagram of modules used in Program Entry andProgram Builder Systems of the present invention. Program Entry 540 mayfacilitate behavioral or viewership predictions for content which hasyet to be experienced by the public based on data entered by a customer.Program Entry 540 may allow a customer to enter attribute ranges forcertain content, and Program Entry 540 may report specific attributevalues which best fit a desired outcome. Program Entry 540 may convertsuch customer data to a format readable by Event Definition System 400.Event Definition System 400 may break down content to determine likelyviewership or other behaviors based on statistics generated by otherportions of the present invention.

[0253] Program Builder 430 can compile a content description, includingvarious content attributes, for content that is likely to be popular.Such content descriptions may be based on events randomly piecedtogether from Event Definition System 400. Program Builder 430 may havea non-random component as well, in the form of an iterative system. Aniterative system may reduce processing times and increase the likelihoodof quality matches per unit time.

[0254] Event Definition System 400 can pull events that best describeindividual viewing behaviors and patterns from other portions of thepresent invention, and such events may be entered into to RandomGenerator 440. Random Generator 440 may comprise a random component forProgram Builder 430, and Random Generator 440 may piece together contentcombinations to build a hypothetical program.

[0255] Random Generator 440 may comprise an algorithm which can selectfrom content attributes and content components submitted to it in adataset. Such selections may be performed in a computationally randommanner, thereby allowing for a variety of dynamically generated content.In a preferred embodiment, Random Generator 440 may include an option,selected through Program Builder 430, which can mark as used thoseelements selected as part of a dataset, there by restricting therecurrence of such elements.

[0256] As with customer-generated content, probable content popularityfor content generated by Program Builder 430 may be calculated throughProgram Entry 540. However, this calculation may also be run in aniterative or non-iterative cycle so that an optimal “proposed” programmay be described.

[0257]FIG. 9 is a block diagram of modules used in a Data Mining andPrediction System of the present invention. Data Customer 510 maycomprise a customer of the present invention interested in informationfrom Prediction System 240 or Graphic Correlation System 920.

[0258] Graphic Correlation System 920 may comprise a graphic correlationdatabase that may be updated by Series Analysis System 980. SeriesAnalysis System 980 can analyze correlations in graphic data alone,without respect to tuner data. Graphic Correlation System 920 may holdcorrelations determined thus far, and may generate additionalcorrelations. Data Customer 510 may use graphic Correlation System 920and its correlations.

[0259] Prediction System 240 may comprise both a system of algorithmswhich determines statistical probabilities of DATA1, DATA2, DATA3 orother behaviors that have or will occur, and a web-based system allowingData customer 510 to query the present invention about these behaviorsor trends. Such queries may be entered through a variety of means,including natural language, graphical, or command-line interfaces. Thecomputational algorithms of Prediction System 240 may be similar tothose of Series Analysis System 980, except that Prediction System 240may be concerned with general behavior patterns, and not necessarilythose having to do with television-related behaviors.

[0260] Sales Data 520, which may alternatively be seen as DATA3, maycomprise sales or other operational information about businessperformance or trends of customers, competitors, or an industry ingeneral. Such data may share features, such as a zip code, with a DATA2counterpart. Market Customer 530 or other data sources may provide suchdata in a format readable by Series Analysis System 980. This data maycomprise data based on sales figures for operations in certaingeographic regions, or other information, such as colors or streetlocations of stores, phone numbers, building types, and the like.

[0261] Market Customer 530 may comprise a customer interested in pastviewership data or a customer attempting to predict the desirability ofpreviously unaired content. In a preferred embodiment, Market Customer530 may provide additional data for Sales Data 520; such data may becross-referenced to commercial programming, competitors, and the like.

[0262] While a preferred embodiment of the present invention is gearedtoward measuring television viewership, the present invention may beuseful for other purposes. For example, the Series Analysis algorithmsused by the present invention may be run against Graphic Data, SalesData 520, and the like without respect to set-top box data. Through suchanalysis, the present invention may provide detailed demographic data inaddition to a market research services. Such analyses may be extended tolook for trends in demographic data, thus further refining theunderstanding of a geographic region for advertisers, governmentagencies, and others interested in such data. This analysis couldeffectively be used to further define the effects of commercialprogramming, to more appropriately plan cities and city services, andother such purposes.

[0263] Weighting/Specification Selector 250 may determine graphiccategories of interest to such parties by evaluating values fromEvolving STB Specification Percentages Database 950. Low-relevancycategories may be rejected and new categories can be selected orintroduced for evaluation. Weighting/Specification Selector 250 candetermine graphic categories that may be most relevant to viewingbehavior, and may comprise an algorithm that weighs categories accordingto sets of specification percentage sums.

[0264] Higher mutually exclusive category sums, built on a smallernumber of specification percentages, may create a higher weight. Aformula may be provided that determines whether a category should beexcluded in the next update of IDM Graphic Matrix 910. Depending on thenature of IDM Calculation Algorithm 270 or 940, weights may be includedin the calculation system rather than responsible for category exclusionin the Identity Matrix.

[0265] While the preferred embodiment and various alternativeembodiments of the invention have been disclosed and described in detailherein, it will be apparent to those skilled in the art that variouschanges in form and detail may be made therein without departing fromthe spirit and scope thereof.

We claim as our invention:
 1. A market data acquisition system,comprising: a means for retrieving event and embedded content data froma plurality of set-top boxes; a means for retrieving content attributesfrom a content attribute database; a means for correlating retrievedset-top box event data with content attributes to produce dataindicating which content was experienced through the plurality ofset-top boxes; a means for retrieving demographic information from ademographic information database; and a means for correlatingdemographic information to data indicating which content was experiencedthrough the plurality of set-top boxes to produce, in response to aquery, data indicating content experienced by a demographic group or setof demographic groups.
 2. The market data acquisition system of claim 1,in which said state-change data collection means collects data from saidset-top boxes without access to set-top box specific personal ordemographic information, thereby providing a layer of privacy to set-topbox assignees.
 3. The market data acquisition system of claim 2, inwhich set-top box specific demographic or other personal data may becollected when requested or with approval given by a set-top boxassignee, governmental agency, or other such authority.
 4. The marketdata acquisition system of claim 3, in which a list of set-top boxidentification numbers and zip codes or other geographic identifierscorresponding to set-top box installation points is provided to thepresent invention for each set-top box.
 5. The market data acquisitionsystem of claim 1, in which said content attribute database ismaintained as part of the system.
 6. The market data acquisition systemof claim 1, in which said content presentation system is maintainedexternal to the present invention.
 7. The market data acquisition systemof claim 1, in which said demographic information database ismaintained.
 8. The market data acquisition system of claim 1, in whichsaid demographic information database is maintained externally.
 9. Themarket data acquisition system of claim 1, in which said queries areentered through a graphical, command-line, or natural languageinterface.
 10. The market data acquisition system of claim 9, in whichsaid queries can result in the generation reports for any time segmentor set of time segments with high precision.
 11. The market dataacquisition system of claim 9, in which said queries result in thegeneration of reports generated individual content or for a set ofcontent.
 12. The market data acquisition system of claim 9, in whichsaid queries result in generation of said reports for persons fitting ademographic specification, persons fitting a demographic category, orpersons fitting sets of demographic specifications and demographiccategories.
 13. The market data acquisition system of claim 9, in whichsaid queries result in reports generated for specific behaviors.
 14. Themarket data acquisition system of claim 9, in which said queries includeone or more highly-specific times, demographic specifications, viewerbehaviors, and content descriptions.
 15. The market data acquisitionsystem of claim 9, in which said results are presented in a graphicalmanner, such as through a pie chart or bar graph.
 16. The market dataacquisition system of claim 9, in which said results are presented as aspreadsheet or other grid.
 17. The market data acquisition system ofclaim 9, in which said results are presented as natural language. 18.The market data acquisition system of claim 1, in which said contentinformation is obtained from a source external to the present invention.19. The market data acquisition system of claim 1, in which said contentinformation is embedded in content as it is presented to a set-top box.20. A method of correlating dynamic and static datasets sharing at leastone common characteristic and having an assumed relationship, and usingsuch correlations to determine rule systems between the sets, comprisingthe steps of: selecting subsets of said datasets sharing a commoncharacteristic; expressing the assumed relationship as a mathematicalassumption; defining an error function which describes the two datasetsin terms of said mathematical assumption; performing fitting proceduresto account for errors in the assumed relationship; and performingfitting procedures which account for errors in the definition of thecommon subsets.
 21. The method of claim 20, in which said dynamic datacorresponds to set-top box event data.
 22. The method of claim 21, inwhich said static data corresponds to demographic data.
 23. The methodof claim 22, in which correlations are drawn between set-top box eventdata and demographic to determine the relationship of demographics tocontent viewership.
 24. A method of testing assumptions pertaining torelationships between two disparate datasets sharing at least one commonaspect, comprising the steps of: entering such assumptions through auser interface; selecting sample data from a first dataset; determiningcorrelations between said selected data and data stored in a seconddataset; and establishing assumption validity based on suchcorrelations.
 25. A method of determining individual characteristics bycorrelating dynamic and static datasets sharing at least one commoncharacteristic and having an assumed relationship, comprising the stepsof: selecting subsets of said datasets sharing a common characteristic;expressing the assumed relationship as a mathematical assumption;defining an error function which describes the two datasets in terms ofsaid mathematical assumption; performing fitting procedures to accountfor errors in the assumed relationship; storing such correlations in anindividual-specific array; and iteratively repeating this process. 26.The method of claim 25, in which said dynamic dataset corresponds toset-top box data.
 27. The method of claim 26, in which said staticdataset corresponds to demographic data.
 28. The method of claim 27, inwhich said individual-specific data corresponds to a set-top boxidentification number or other privacy-compliant identification number.29. The method of claim 28, in which an IDM algorithm determines saidcorrelations.
 30. A method of dynamically determining the demographicidentity of an individual operating a set-top box, comprising the stepsof: monitoring set-top box events for a plurality of set-top boxes;correlating set-top box events with demographic characteristics;applying IDM calculation techniques to determine probabilities fordemographic characteristic and set-top box event dataset correlations;ascribing demographic characteristic probabilities to each set-top boxover time based on observed set-top box events and their relationship tosuch IDM probabilities; evaluating such ascribed demographiccharacteristic probabilities over time through statistical analysis;fitting probabilities ascribed to demographic characteristics tostatistically determine the most likely set of constant datasetpossibilities for each set-top box; and, fitting set-top box possibilitysets to IDM probability sets for a set-top box event.
 31. The method fordetermining the demographic identities of individuals in a home,business, or other location containing a set-top box according to themethod of claim 30, further comprising the steps of: storing saiddemographic identities in an array over time; and applying statisticalanalyses to said array to determine predominant demographic identitiesfor a given set-top box.
 32. A system for directing content to aspecific demographic group, comprising: an array identifying demographicidentities associated with set-top boxes; a means for entering ademographic group to be targeted; a means for entering the content, or areference to such content, to be directed to a demographic group; ameans for entering times and other properties indicating a preferredcontent delivery method; and a means of delivering content to a set-topbox corresponding to requested demographic information.
 33. The systemof claim 32 in which said content refers to advertising.
 34. A systemfor directing content to set-top boxes exhibiting a behavior or patternof behaviors when a specified content type is presented, comprising: aset of set-top box events with specific time recordings for each event;a set of content properties; a means for correlating set-top box eventsto content properties; a means for entering desired set-top boxevent/content property correlations; a means for delivering content tothose set-top boxes corresponding to said set-top box event/contentproperty correlations.
 35. The system of claim 34 in which said contentrefers to advertising.
 36. A method of determining the effect of contentattributes on content ratings, comprising the steps of: obtainingcontent attributes from embedded content information or from externalsources; recording set-top box events as content is experienced;correlating set-top box events to content attributes; and, analyzingsuch correlations over time to determine the effect of contentattributes on content ratings.
 37. The method of claim 36 in which saidcontent attributes include times at which various content attributes arepresented to a set-top box, thereby allowing the present invention toprovide detailed correlations between such attributes and set-top boxevents.
 38. A method of determining the effect of content attributes oncontent ratings for a specific demographic group, comprising the stepsof: obtaining content attributes from embedded content information orfrom external sources, recording set-top box events as content isexperienced; correlating set-top box events to content attributes;correlating set-top box events and content attributes to demographiccharacteristics for each set-top box; and analyzing such correlationsover time to determine the effect of content attributes on contentratings for specific demographic groups.
 39. The method of claim 38 inwhich said content attributes include times at which various contentattributes are presented to a set-top box, thereby allowing the presentinvention to provide detailed correlations between set-top box events,set-top box demographics, and content attributes.
 40. A method ofcreating new content based on previously experienced content and contentratings, comprising the steps of: obtaining content attributes fromembedded content information or from external sources; recording set-topbox events as content is experienced; correlating set-top box events tocontent attributes; analyzing such correlations over time to determinethe effect of content attributes on content ratings; and analyzing theeffect of content attribute order on content ratings; and determining apreferred content attribute set and content attribute presentationorder.
 41. The method of claim 40 in which said content attributesinclude times at which various content attributes are presented to aset-top box.
 42. A method of creating new content based on previouslyexperienced content and content ratings, where such new content isdirected toward a demographic group, comprising the steps of: obtainingcontent attributes from embedded content information or from externalsources; recording set-top box events as content is experienced;correlating set-top box events to content attributes; correlatingset-top box events and content attributes to demographiccharacteristics; analyzing such correlations over time to determine theeffect of content attributes on content ratings for a given demographicgroup; and analyzing the effect of content attribute order on contentratings for a given demographic group; and determining a preferredcontent attribute set and content attribute presentation order for agiven demographic group.
 43. A system for predicting future events basedon a proposed dataset, consisting of: a dataset of past events; a knowndataset sharing at least one attribute with said dataset of past events,and with substantially similar attributes to said proposed dataset; ameans of correlating said dataset of past events with said known datasetto form a new dataset; and, a means of correlating said new dataset tosaid proposed dataset.
 44. The system of claim 43, where said dataset ofpast events consists of set-top box event data.
 45. The system of claim44, in which said known dataset consists of sales figures.
 46. Thesystem of claim 44, where said known dataset consists of contentattributes and content presentation data.
 47. The system of claim 46,where said proposed dataset consists of a set of content attributes forproposed content.
 48. A method of predicting future events given aproposed dataset, comprising the steps of: monitoring past events;correlating said past events with a dataset sharing at least oneattribute with said past events, and with a substantially similarstructure to the proposed dataset, the results of such are stored in anarray; correlating said array with said proposed dataset; and reportingthe results of said array/proposed dataset correlations as a predictionof future events.
 49. The method of claim 48, in which said past eventsinclude set-top box events.
 50. The method of claim 49, in which saidproposed dataset substantially consists of proposed content attributes.51. The method of claim 50, in which said dataset includes previouslypresented content attributes.
 52. The method of claim 51, in which saiddataset consists of sales figures.
 53. A system of predicting futureevents for a given demographic segment, comprising: a dataset of pastevents; a demographic dataset sharing at least one attribute with saiddataset of past events; a means of correlating said dataset of pastevents with said demographic dataset and storing the result in an array;a known dataset sharing at least one attribute with said demographicdataset, and with substantially similar attributes to said proposeddataset; a means of correlating said array with said known dataset toform a new dataset; and, a means of correlating said new dataset to saidproposed dataset.
 54. The system of claim 53 in which said dataset ofpast events corresponds to set-top box data.
 55. The system of claim 54in which said demographic dataset shares a zip code or other geographicidentifier with said set-top box data.
 56. The system of claim 55 inwhich said known dataset shares a zip code or other geographicidentifier with said array.
 57. The system of claim 56 in which saidproposed dataset is comprised of proposed content and attributescorresponding thereto.
 58. A method of predicting future events for agiven demographic based on a proposed dataset, comprising the steps of:monitoring past events; correlating said past events with a demographicdataset and storing the result in an array; correlating said array witha dataset sharing at least one attribute with said array, and with asubstantially similar structure to the proposed dataset, the results ofsuch are stored in an additional dataset; correlating said additionaldataset with said proposed dataset; and reporting the results of suchcorrelations as a prediction of future events.
 59. The method of claim58, in which said past events include set-top box events.
 60. The methodof claim 59, in which said demographic dataset and said set-top boxevent dataset both contain zip code attributes.
 61. The method of claim60, in which said proposed dataset substantially consists of proposedcontent attributes.
 62. The method of claim 61, in which said datasetincludes previously presented content attributes.
 63. Aprivacy-compliant data collection and data correlation systemcomprising: a means of collecting individual-specific behavior datawithout knowing individual-specific demographic information pertainingto the individual about whom such data is collected; a means ofaccessing demographic data for the region in which the individualresides; and a means of correlating such individual-specific data withsuch demographic data to determine the demographic identity of eachindividual about whom data is collected.
 64. The privacy-compliant datacollection and data correlation system of claim 63, wherein saidindividual-specific behavior data collection means is comprised of aset-top box.
 65. A method of predicting behaviors of non-sampleddemographic specifications based on sampled demographic specificationsof a given level comprising the steps of: monitoring past behavior andcorrelating such behavior with demographic characteristics monitored;breaking a non-sampled demographic specification into sub-specificationsfor which sample data has been collected; establishing the statisticaleffects of various rules on each sub-specification and thosecharacterizations comprising them; and statistically predictingnon-sample behaviors based on such effects.
 66. The method claim 65,further comprising the steps of: observing correlations betweenbehaviors of sampled demographic specifications or sub-specificationsand behaviors of non-sampled demographic specifications, and inferringbehaviors of such non-sampled demographic specifications from suchcorrelations, such that predicted or observed sampled demographicspecification behaviors may be reported as non-sampled demographicspecification behaviors within a determinable level of accuracy.
 67. Amethod of reducing the effect of sampling error and sample bias on datacorrelations determined between a dynamic dataset and a static datasetbased on assumptions about the relationships between such data, such as:creating equations to express such assumptions; determining errorfunctions which can assist in calculating values for each unknownvariable in such equations; creating a transformable matrix based onsuch functions; inverting said matrix to apply a least-squares approachfitting method to the underlying data; normalizing the results of saidleast-squares fit; calculating Pearson-r correlations for suchnormalized results; calculating aspect representation indices for eachsubset of data within said static dataset; determining assumptionvalidities for assumptions used as a basis for this process; andcombining said correlations, said aspect representation indices, andsaid assumption validities to create a set of data correlations andcorresponding confidence intervals.
 68. The method of claim 67 in whichsaid dynamic dataset represents set-top box event data.
 69. The methodof claim 68 in which said static dataset represents demographicinformation.
 70. The method of claim 69 in which the assumption used torelate said set-top box event data with said demographic information isthe demographic assumption.
 71. The method of claim 20, in which saidfitting procedures include applying additional assumptions to providemissing correlations values.
 72. A method of increasing correlationresult dataset specificity by reducing possibilities, consisting of thesteps: calculating correlation result dataset characterization valueswhich fall within a predetermined confidence limit using aspectrepresentation indices, inverse demographic matrices, recombinationmatrices, and specification similarity matrices; creating a matrix ofsuch values for all demographic characterizations for each method used;utilizing mathematical expressions of the requirement of consistency fordistinct value ranges for identical characterizations in the separatematrices, reducing each range for a given characterization to thegreatest possible extent within a predetermined confidence interval;thus producing one matrix with one value range for eachcharacterization; possibly transforming value ranges for allcharacterizations within said matrix to the same statistical confidence;iteratively reducing all ranges to the greatest possible extent byutilizing both mathematical expressions of the requirement ofconsistency among all value ranges in said matrix as well as constraintsgiven by actual characterization population numbers; and adjusting thestatistical confidence if necessary to allow for further value rangereduction past the point of useful iteration at a previous statisticalconfidence.
 73. The method of claim 72 in which said datasetcorrelations result from correlations of set-top box event data anddemographic data.
 74. The method of claim 72 in which said datasetcorrelations result from correlations of demographic data and salesdata.
 75. The method of claim 72 in which said dataset correlationsresult from correlations of set-top box data and sales data.
 76. Amethod of fitting by convergence and similarity between a static datasetand a dynamic dataset, comprising the steps of: defining subsets of eachdataset; determining correlations between such datasets; performing atime-based analysis of group representations and additional correlationswithin said correlations; assigning weights to such representations andadditional correlations; and, applying such weights and values todetermine undefined correlation dataset values.
 77. The method of claim76 in which said dynamic dataset represents set-top box data.
 78. Themethod of claim 77, in which said static dataset represents demographicdata.
 79. The method of claim 78 in which said unidentified correlationdataset values represent non-sampled demographic specifications.
 80. Amethod of invalidating set-top box events, comprising the steps of:monitoring set-top box events; storing such events in an array;calculating trends in such events; invalidating set-top box events whichdeviate in a statistically significant manner from observed set-top boxevent trends, or which match previously defined invalid set-top boxevents; placing such invalidated set-top box events in an array; andcalculating trends in such invalidated set-top box events such that somelong-term trends may be revalidated, and to identify new set-top boxevent categories to be ignored.